New pre-built sglang Docker Images for NVIDIA DGX Spark

Similar to community-friendly prebuilt vllm images (New pre-built vLLM Docker Images for NVIDIA DGX Spark), I’m going to try to start maintaining prebuilt sglang images.

Images

  • scitrera/dgx-spark-sglang:0.5.8-t4

    • SGLang 0.5.8 (with build fixes post-release)
    • PyTorch 2.10.0 (with torchvision + torchaudio)
    • CUDA 13.1.1
    • Transformers 4.57.6
    • Triton 3.6.0
    • NCCL 2.29.3-1
    • FlashInfer 0.6.3
  • scitrera/dgx-spark-sglang:0.5.8-t5

    • Same as above, but with Transformers 5.1.0

Example Usage (SGLang)

docker run \
  --privileged \
  --gpus all \
  -it --rm \
  --network host --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  scitrera/dgx-spark-sglang:0.5.8-t4 \
  sglang serve \
    --model-path Qwen/Qwen2.5-7B-Instruct \
    --mem-fraction-static 0.4

Note: These are still experimental builds.

Open for feedback from people looking to test out alternative to vllm. They are often very similar but depending on the workload, sglang might yield superior performance.

4 Likes

Confirmed able to use NCCL-based distributed torch clustering for Qwen3-Coder-Next on 4x DGX Spark cluster with tensor parallelism.

Benchmark: llama-benchy --base-url "http://10.24.11.13:8000/v1" --pp 512 2048 8192 16384 32768 65535 131072 --tg 32 128 --runs 5 --model "Qwen/Qwen3-Coder-Next" --served-model-name "qwen3-coder-next" (NOTE: llama-benchy requires minor patch for sglang compatibility – PR sent – FYI @eugr)

| model                 |     test |             t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:----------------------|---------:|----------------:|-------------:|------------------:|------------------:|------------------:|
| Qwen/Qwen3-Coder-Next |    pp512 | 2598.79 ± 59.33 |              |     199.78 ± 4.48 |     197.12 ± 4.48 |     199.83 ± 4.49 |
| Qwen/Qwen3-Coder-Next |     tg32 |    51.75 ± 0.46 | 53.43 ± 0.47 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |    pp512 | 2626.45 ± 86.05 |              |     197.80 ± 6.23 |     195.14 ± 6.23 |     197.85 ± 6.23 |
| Qwen/Qwen3-Coder-Next |    tg128 |    52.06 ± 0.28 | 53.20 ± 0.40 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |   pp2048 | 4633.44 ± 89.05 |              |     444.79 ± 8.71 |     442.13 ± 8.71 |     444.84 ± 8.71 |
| Qwen/Qwen3-Coder-Next |     tg32 |    50.43 ± 0.50 | 52.06 ± 0.52 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |   pp2048 | 4608.33 ± 40.62 |              |     447.11 ± 3.92 |     444.45 ± 3.92 |     447.16 ± 3.92 |
| Qwen/Qwen3-Coder-Next |    tg128 |    51.30 ± 0.12 | 53.00 ± 0.00 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |   pp8192 | 7226.66 ± 42.36 |              |    1136.28 ± 6.64 |    1133.62 ± 6.64 |    1136.33 ± 6.64 |
| Qwen/Qwen3-Coder-Next |     tg32 |    45.67 ± 0.37 | 47.15 ± 0.39 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |   pp8192 | 7253.72 ± 55.98 |              |    1132.02 ± 8.75 |    1129.36 ± 8.75 |    1132.07 ± 8.75 |
| Qwen/Qwen3-Coder-Next |    tg128 |    46.54 ± 0.16 | 48.00 ± 0.00 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |  pp16384 | 6992.62 ± 19.51 |              |    2345.72 ± 6.53 |    2343.06 ± 6.53 |    2345.76 ± 6.53 |
| Qwen/Qwen3-Coder-Next |     tg32 |    41.79 ± 0.12 | 43.15 ± 0.13 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |  pp16384 | 6935.13 ± 18.04 |              |    2365.11 ± 6.19 |    2362.45 ± 6.19 |    2365.16 ± 6.18 |
| Qwen/Qwen3-Coder-Next |    tg128 |    42.32 ± 0.19 | 43.60 ± 0.49 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |  pp32768 | 6149.32 ± 15.66 |              |   5331.41 ± 13.61 |   5328.75 ± 13.61 |   5331.46 ± 13.61 |
| Qwen/Qwen3-Coder-Next |     tg32 |    36.09 ± 0.20 | 37.26 ± 0.21 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |  pp32768 | 6154.14 ± 29.38 |              |   5327.29 ± 25.54 |   5324.63 ± 25.54 |   5327.34 ± 25.54 |
| Qwen/Qwen3-Coder-Next |    tg128 |    36.67 ± 0.10 | 38.00 ± 0.00 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |  pp65535 | 5014.19 ± 17.97 |              |  13072.73 ± 46.83 |  13070.07 ± 46.83 |  13072.78 ± 46.82 |
| Qwen/Qwen3-Coder-Next |     tg32 |    28.40 ± 0.14 | 29.00 ± 0.00 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next |  pp65535 | 4957.84 ± 17.64 |              |  13221.29 ± 46.95 |  13218.63 ± 46.95 |  13221.36 ± 46.96 |
| Qwen/Qwen3-Coder-Next |    tg128 |    28.53 ± 0.34 | 29.80 ± 0.40 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next | pp131072 | 3625.16 ± 15.60 |              | 36159.50 ± 155.42 | 36156.84 ± 155.42 | 36159.55 ± 155.42 |
| Qwen/Qwen3-Coder-Next |     tg32 |    19.66 ± 0.12 | 20.40 ± 0.49 |                   |                   |                   |
| Qwen/Qwen3-Coder-Next | pp131072 | 3637.82 ± 13.35 |              | 36033.56 ± 131.95 | 36030.90 ± 131.95 | 36033.61 ± 131.96 |
| Qwen/Qwen3-Coder-Next |    tg128 |    19.79 ± 0.18 | 21.00 ± 0.00 |                   |                   |                   |

llama-benchy (0.1.dev77+gae09cab52)
date: 2026-02-15 18:38:02 | latency mode: api

Will try to put out a tutorial on sglang-based clustering soon… used the scitrera/dgx-spark-sglang:0.5.8-t5 image.

2 Likes

Thanks, I’ll have a look later!
EDIT: merged, but I believe a different PR that did the same thing, just a bit more pythonic.

Yeah I should’ve looked at other PR. just didn’t expect it to be same thing I guess. I’m not sure if it was “more pythonic” – otherwise they shouldn’t have added the walrus operator lol.

technically mine involved 1 less dict key lookup if the if statement passed ;-)

but same difference – it makes no difference here – I just got in the habit of doing that because I had some nightmares doing some optimization back in python2.7 (every self.something and obj['sdfs'] means a lookup – and for what I was doing in particular – it actually mattered A LOT) and the addition of the walrus operator (years later) seemed like just what was needed to help cut down some of that. So I became a bit of a devotee.

1 Like

Thanks! I will move this over to GB10 projects

whoops… I forgot to post here!

Released 0.5.9 sglang images

scitrera/dgx-spark-sglang:0.5.9-t4
scitrera/dgx-spark-sglang:0.5.9-t5

  • FlashInfer upgraded to 0.6.4
  • Transformers 4.5.6 (-t4)
  • Transformers from git (a4a176171c47979125025041adc4f8d201aec310) (in-between 5.2.0 and 5.3.0) (-t5)
3 Likes

I’m still very new to this, but I’ve tried running this with Qwen/Qwen3-coder-next and it loads very quickly and gobbles up all 128GB of my DGX Spark and crashes.
I’ve tried to set small context-length and low mem-fraction-static percentage, with no effect.

Any tips to help troubleshoot this?

You’ll need to at least use the FP8 version to fit on a single spark.

I’d recommend using sparkrun to help you; Installation | sparkrun

And then you can run on your spark with:

sparkrun run qwen3-coder-next-fp8-sglang --tp 1 -H 127.0.0.1

The -H 127.0.0.1 is to instruct sparkrun to run locally. The --tp 1 is to force running on a single node. The recipe actually defaults to 2 sparks. Note that it should just barely fit on a single Spark if you want a large context length, but you should be careful with concurrent requests and likely nothing else can be running on your Spark at the same time.

It’ll output what it is doing while it goes to run, but you can also get information about the “recipe” by typing:

user@spark-1234:~$ sparkrun show qwen3-coder-next-fp8-sglang --tp 1

Name:         qwen3-coder-next-fp8-sglang
Description:  Qwen3 Coder Next (upstream FP8 quant) -- cluster only
Maintainer:   scitrera.ai <open-source-team@scitrera.com>
Runtime:      sglang
Model:        Qwen/Qwen3-Coder-Next-FP8
Container:    scitrera/dgx-spark-sglang:0.5.9-t5
Nodes:        2 - unlimited
Repository:   sparkrun-transitional
File Path:    /home/drew/.cache/sparkrun/registries/sparkrun-transitional/transitional/recipes/qwen3-coder-next/qwen3-coder-next-fp

Defaults:
  attention_backend: triton
  fp8_gemm_backend: cutlass
  gpu_memory_utilization: 0.8
  host: 0.0.0.0
  max_model_len: 200000
  port: 8000
  served_model_name: qwen3-coder-next
  tensor_parallel: 2
  tool_call_parser: qwen3_coder

Command:
  python3 -m sglang.launch_server \
    --model-path {model} \
    --served-model-name {served_model_name} \
    --context-length {max_model_len} \
    --mem-fraction-static {gpu_memory_utilization} \
    --tp-size {tensor_parallel} \
    --host {host} \
    --port {port} \
    --attention-backend {attention_backend} \
    --fp8-gemm-backend {fp8_gemm_backend} \
    --tool-call-parser {tool_call_parser}

VRAM Estimation:
  Model dtype:      fp8
  Model params:     80,000,000,000
  KV cache dtype:   bfloat16
  Architecture:     48 layers, 2 KV heads, 256 head_dim
  Model weights:    74.51 GB
  KV cache:         18.31 GB (max_model_len=200,000)
  Tensor parallel:  1
  Per-GPU total:    92.82 GB
  DGX Spark fit:    YES

  GPU Memory Budget:
    gpu_memory_utilization: 80%
    Usable GPU memory:     96.8 GB (121 GB x 80%)
    Available for KV:      22.3 GB
    Max context tokens:    243,512
    Context multiplier:    1.2x (vs max_model_len=200,000)
1 Like

I’m using scitrera/dgx-spark-sglang:0.5.9-t5 on a single DGX Spark with SGLang, running Qwen3.5 27B int4-autoround successfully. I’ve now purchased a DGX Spark clone (Gigabyte) to create a cluster. What’s the recommended approach for using this container image in a multi-node configuration?

sparkrun. I also recommend it for single nodes (not just multi-node clusters).

The latest version also includes a CLI setup wizard to try to help you do setup for networking, SSH config, etc. The wizard is new, so feedback is appreciated, but hopefully it would help you to do the configuration as well.

Install uv if you don’t have it already: curl -LsSf https://astral.sh/uv/install.sh | sh

Install sparkrun and start the wizard: uvx sparkrun setup

You’ll probably want to accept a lot of the defaults / say yes a lot – but you’ll have to give it the IP of your first node and the other node when it asks, e.g.: 127.0.0.1,192.168.44.21 where 127.0.0.1 means current system and 192.168.44.21is the ethernet IP of your new 2nd spark. It may ask you to type in the passwords as part of setup process, it doesn’t save them. (Example IPs written assuming you’re operating via Spark#1).

Then you can use existing “recipes” for sglang models from the preconfigured registries or make your own.

drew@spark-2918:~$ sparkrun list sglang
Name                           Runtime   TP   Nodes   GPU Mem   Model                        Registry               
--------------------------------------------------------------------------------------------------------------------
qwen3-1.7b-sglang              sglang    1    1       0.3       Qwen/Qwen3-1.7B              sparkrun-transitional  
qwen3-coder-next-fp8-sglang    sglang    2    2       0.8       Qwen/Qwen3-Coder-Next-FP8    sparkrun-transitional  
qwen3.5-0.8b-bf16-sglang       sglang    1    1       0.8       Qwen/Qwen3.5-0.8B            sparkrun-transitional  
qwen3.5-122b-a10b-fp8-sglang   sglang    2    2       0.8       Qwen/Qwen3.5-122B-A10B-FP8   sparkrun-transitional  
qwen3.5-27b-fp8-sglang         sglang    1    1       0.8       Qwen/Qwen3.5-27B-FP8         sparkrun-transitional  
qwen3.5-2b-bf16-sglang         sglang    1    1       0.8       Qwen/Qwen3.5-2B              sparkrun-transitional  
qwen3.5-35b-a3b-bf16-sglang    sglang    1    1       0.8       Qwen/Qwen3.5-35B-A3B         sparkrun-transitional  
qwen3.5-35b-a3b-fp8-sglang     sglang    1    1       0.8       Qwen/Qwen3.5-35B-A3B-FP8     sparkrun-transitional  
qwen3.5-4b-bf16-sglang         sglang    1    1       0.8       Qwen/Qwen3.5-4B              sparkrun-transitional  
qwen3.5-9b-bf16-sglang         sglang    1    1       0.8       Qwen/Qwen3.5-9B              sparkrun-transitional  

The registries are all publicly available git repos. Recently, I’ve been working with @eugr and @raphael.amorim on Spark Arena, and since @eugr is the king of DGX Spark vLLM, I’ve been rather vLLM focused lately (i.e. in the past 1-2 weeks), but I do plan to come back to sglang containers and recipes.

You can run an existing recipe easily enough:

Run it with default settings: sparkrun run qwen3.5-35b-a3b-fp8-sglang

Override with tensor parallelism over nodes and reduce gpu memory utilization sparkrun run qwen3.5-35b-a3b-fp8-sglang --tp 2 --gpu-mem 0.5 – which should give you a nice speed boost leveraging both nodes (and I reduced the target memory utilization in this example to leave some more RAM open for other things)

You can view the recipe text with:
sparkrun export recipe qwen3.5-35b-a3b-fp8-sglang
–or–
save it a file with: sparkrun export recipe qwen3.5-35b-a3b-fp8-sglang --save my-recipe.yaml

Then you can edit the defaults to your preferences, save it, and run

sparkrun run ./my-recipe.yaml and it’ll not require you to override settings at CLI.

When you make your own recipes, you can also change the model, container basis, etc., so you can pretty much automate running whatever you want to run. Then you could publish your recipes to registries to manage them via git or to share them with others.

You could even install a model as a system service with sparkrun export systemd.

You also can run sparkrun on a linux/mac/(Windows via WSL) remote machine (not one of your sparks) and it can be used to manage/orchestrate your sparks.

There are fairly complete docs on the website (https://sparkrun.dev) so you can look stuff up there or chat on the forums about it: Sparkrun - central command with tab completion for launching inference on Spark Clusters - #48 by dbsci

Happy Sparking!

P.S. >> I forgot to mention there is also a claude code plugin. So once you’re setup, you can use the claude code plugin to check/start/stop inference jobs via claude code. More AI automation will be coming soon.

5 Likes

Amazing progress on Sparkrun @dbsci. Really amazing, world class quality of work and commitment to get all the integration pieces done so quickly and smoothly. Super grateful to have you leading SparkArena planning and implementation with @eugr and I now. Great team work.

4 Likes

Thank you. I am stilling waiting for the cable so I can connect the cluster. what’s the design difference between sparkrun and spark-vllm-docker? I have used spark-vllm-docker, so I have basically built 2 images vllm-node and vllm-node-tf5, and run vllm with given recipes. I also download or create recipes to test other HF models. I’ve never used sparkrun. I just feel impressive with scitrera/dgx-spark-sglang:0.5.9-t5. I would like to know the difference between it with vllm-node-tf5.

1 Like

would like to thank you guys for your work, we are all learning a lot from you :)

1 Like

sparkrun is an orchestration tool – it just coordinates the activities to make it smoother to use everything else.

spark-vllm-docker is a well-maintained system for building the latest version of vLLM for the DGX Spark.

sglang is an alternative/competing inference engine/framework to vLLM.

There is tremendous overlap between vllm and sglang but they do have some differences in how they approach some details – which is why you’ll find that it’s task/situation dependent which is the best to use. I find sglang tends to perform really well for heavy prefix matching but since it’s all open source, everyone copies what works well from everyone else – so it can be confusing.

So basically you’ve got vllm and sglang as two top inference frameworks. They do the job of actually running the model and providing the server for your LLM requests. The containers are just packaged installations of those. Because everything is changing so fast in AI/LLMs/etc., there is a reason to use a new version practically daily (depending on your particular needs). (And I should mention that getting these things to run on the DGX Spark in particular can require extra work beyond the vanilla build.)

So spark-vllm-docker is sort of a mix of things – @eugr spends a lot of time and energy to make sure that you don’t have regressions/problems that come up with new versions. We all need the latest version for various reasons, but sometimes that’s a bad idea (stuff breaks). He’s basically trying to manage that problem and make it easy to get the latest useful version. And that’s what you’re building.

scitrera/dgx-spark-sglang:0.5.9-t5 is an image of sglang that I built with somewhat similar principles except that I focused on publishing the prebuilt images over the build toolchain (which is also open source and on github). As part of work with Spark Arena, I’ll be working on more toolchains for sglang as well to basically mirror what @eugr is doing for vllm.

And back to sparkrun, sparkrun is an automation tool to make it easier for you to use all of these things – for people with 1 spark or many. Including running benchmarks so that you can upload them to spark arena and see how your particular configuration/recipe stacks up.

Rambling Complete.

Edit: Not Complete! Also llama.cpp is another top inference framework and we also publish images for that: ghcr.io/spark-arena/dgx-llama-cpp:latest. llama.cpp is pretty different from the others in that it’s basically a reimplementation of everything. It’s good for single node inference and can be a quick onramp to get things moving as well. sparkrun works with that too ;-).

5 Likes

I have one question with this image, that is the sglang version is 0.0.0. below is my dump:
“container”: “sglang_node_tf5”,
“tool”: “collect_env”,
“format”: “json”,
“data”: {
“python”: “3.12.3”,
“python_full”: “3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0]”,
“platform”: “linux”,
“transformers”: “5.2.0.dev0”,
“torch”: “2.10.0”,
“sglang”: “0.0.0”,
“numpy”: “2.4.2”,
“torch_version”: “2.10.0”,
“torch_cuda_build”: “13.1”,
“torch_cudnn”: 91900,
“nvidia_smi_exit”: 0,
“nvidia_smi”: “NVIDIA GB10, 580.126.09, [N/A]”
}
}

Hmm.. I don’t know if the collect_env tool knows what to do with that or not – or if it’s a quirk of how it is built, but the labels on the container are correct and reflect how it was built:

docker inspect scitrera/dgx-spark-sglang:0.5.9-t5 --format '{{json .Config.Labels}}' | jq
{
  "dev.scitrera.cuda_version": "13.1.1",
  "dev.scitrera.flashinfer_version": "0.6.4",
  "dev.scitrera.nccl_version": "2.29.3-1",
  "dev.scitrera.sglang_version": "0.5.9",
  "dev.scitrera.torch_audio_version": "2.10.0",
  "dev.scitrera.torch_version": "2.10.0",
  "dev.scitrera.torch_vision_version": "0.25.0",
  "dev.scitrera.transformers_version": "5.3.0+git-a4a17617",
  "dev.scitrera.triton_version": "3.6.0",
  "maintainer": "scitrera.ai <open-source-team@scitrera.com>",
  "org.opencontainers.image.ref.name": "ubuntu",
  "org.opencontainers.image.version": "24.04"
}

You can check out the source for dockerfiles, etc.: https://github.com/scitrera/cuda-containers

1 Like

Thank you so much. I have trouble to use this image to run moe images. I can run a dense Qwen3.5-27B-int4-autoaround. but not Qwen3.5-122B-A10B-int4-autoround. what I mean dgx was able to start and generate some tokens, the issue is the text is messed. any idea? below is the script I modified based on script from chatgpt:

#!/bin/bash

# =========================

# Model & Server Config

# =========================

MODEL=“Intel/Qwen3.5-122B-A10B-int4-AutoRound”

SERVED_MODEL_NAME=“qwen3.5-122b”

CONTEXT_LENGTH=32768 # safer than 262k

MEM_FRACTION_STATIC=0.7 # max memory fraction

TENSOR_PARALLEL=1 # single GPU

HOST=“0.0.0.0”

PORT=8000

# =========================

# Backend & Kernels

# =========================

ATTENTION_BACKEND=“triton” # safer than triton default

FP8_GEMM_BACKEND=“triton” # safer for INT4 MoE

MOE_KERNEL_CONFIG=“/workspace/moe_triton_config.json” # custom kernel config (optional)

# =========================

# Parsers (optional, as before)

# =========================

TOOL_CALL_PARSER=“qwen3_coder”

REASONING_PARSER=“qwen3”

# =========================

# Launch Server

# =========================

python3 -m sglang.launch_server \

--model-path ${MODEL} \

--served-model-name ${SERVED_MODEL_NAME} \

--context-length ${CONTEXT_LENGTH} \

--mem-fraction-static ${MEM_FRACTION_STATIC} \

--tp-size ${TENSOR_PARALLEL} \

--host ${HOST} \

--port ${PORT} \

--enable-metrics \

--moe-runner-backend flashinfer_cutlass \

--disable-cuda-graph \

--disable-radix-cache \

--kv-cache-dtype bf16 \

--attention-backend ${ATTENTION_BACKEND} \

--fp8-gemm-backend ${FP8_GEMM_BACKEND} \

--tool-call-parser ${TOOL_CALL_PARSER} \

--reasoning-parser ${REASONING_PARSER} \

--trust-remote-code

I am working on releasing a newer version of sglang container, so it might work with that… there were a few Qwen3.5 bugs resolved since 0.5.9 was released.

In the meantime, it’ll be easier on vllm with sparkrun since I know that one works…

sparkrun run @experimental/qwen3.5-122b-a10b-int4-autoround-vllm

The @experimental prefix is because it is from the spark arena experimental registry for now. We’re in the process of unifying recipe registries, but it’s a bit of an effort because we don’t want to do it unless everything is carefully tested and ready. The recipe file is at: recipe-registry/experimental-recipes/eugr-vllm/qwen3.5-122b-a10b-int4-autoround-vllm.yaml at main · spark-arena/recipe-registry · GitHub.

Once tested, I can put out a recipe for that model with sglang as well, but I’ll have to work through a few things before I’m ready to do that.

You’ll notice that the recipe file shares a lot in common with your script file --and that’s the point. Recipes are just what to run basically. And sparkrun just handles logistics around doing it right – especially relevant when you get into using a cluster and having multiple Sparks working together.

@dbsci I just saw this scitrera/dgx-spark-sglang:0.5.10rc0, are you building this image? I am testing it right now? Cheers.

I guess let me know. I had to retool things to build it – and then didn’t quite finish… so I don’t really know if it’ll work for everything.