Yes. So far, the containers are just fresh builds (I made) of curated version mixes of the official open source packages. I have gotten a bit bogged down with other work, but I will release github repo for build script & Dockerfiles that includes ārecipeā files that coordinate all of the versions.
Iām planning to keep using my (currently) 4x spark cluster, so I plan to maintain these builds for a while since they help me try out the latest models. I basically want to enforce more strict versioning than always using nightly builds (plus e.g. latest vllm nightlies donāt necessarily vary pytorch build, etc.) But I also can accept more risk than NVIDIA in terms of how the monthly containers are released. So I see this as a community stopgap/intermediate to be more bleeding edge than the official NVIDIA containers but somewhat more curated than random mixture of nightly builds.
Somewhat of an aside, but Iām also planning to release (open source either BSD or Apache2 license) something thatās like a mashup of the nvidia sync, the DGX dashboard, and the playbooks to try to make it quick & easy for people to get started and to manage CX7 networking / spark clusters.
scitrera/dgx-spark-vllm:0.14.0-t5 ā vLLM 0.14.0ish (git 63227ac), PyTorch 2.10.0, CUDA 13.1.0, Transformers 5.0.0-rc? (git 0dfb28e1), Triton 3.5.1, NCCL 2.29.2-1; Patch for is_deepseek_mla() included for the benefit of GLM-4.7-Flash.
FYI. Tested GLM-4.7-Flash using ray and -tp4 on 4 spark cluster using scitrera/dgx-spark-vllm:0.14.0-t5 image.
FYI. Tested GLM-4.7-Flash using ray and -tp4 on 4 spark cluster using scitrera/dgx-spark-vllm:0.14.0-t5 image.
Brilliant, I hate to burden you with silly questions but I wonder if you have a quick step by step guide to getting GLM-4.7-Flash in particular running using these new images? I also heard there was a fix that was required for that model so that it wouldnāt create a 180GB KV Cache. Is the fixed version the one that works with this new vllm image or the one with the KV Cache misconfiguration?
That fix/workaround is not included; now itās time for a decision. I havenāt tested it yet and it might work fine on spark, but that fix/adjustment does not work properly right now on B200āwhich is why it hasnāt been merged into vllm yet. So I guess as long as it works on DGX sparkāthen these spark-specific images should have the patch.
Iāve updated the -t5 image to include the patch since it seems to be working properly when tested on my sparks. Here is branch of vllm fork for completeness: GitHub - scitrera/vllm at v0.14.0+glm4-moe-lite-mla
And regarding guide⦠I can convert commands, etc. to a āguideā via chatGPT, but whatās your setup? 1x spark? 2x sparks? If 2x or more, do you already have them connected to each other and setup?
Thank you once more, and would really appreciate even a chatgpt generated guide to get this running.
Maybe you could make a guide for 1 spark, and in your testing, what tokens per second and memory use is there running the 3.7 flash with the patch at nvfp4 with context size set to about 100k? if you have two sparks, it would also be interesting to understand how this scales with two sparks.
I appreciate this is three tasks, the document is most useful of course because right now, many of us canāt run this at all.
Iāll get back to you on other parts; unfortunately I only have 4x sparksāwhich while that sounds like a lot; itās not if youāre using them for workā¦
First of all, thank you for providing these dedicated vLLM images for DGX Spark. I am currently testing the scitrera/dgx-spark-vllm:0.14.0-t5 image on a DGX Spark GB10 (128GB), but Iāve encountered significant performance issues that I hope you can help with.
Model: Qwen/Qwen3-VL-32B-Instruct and Qwen/Qwen3-32B.
Image: scitrera/dgx-spark-vllm:0.14.0-t5.
The Issue:
I am seeing extremely low throughput. Whether using the VL or the pure text version of Qwen3-32B, the generation speed is stuck at ~3 tokens/s. For a Blackwell-based system, I was expecting significantly higher performance.
Configuration Used:
VLLM_USE_V1=1
VLLM_V1_ENABLED=1
VLLM_ATTENTION_BACKEND=FLASHINFER
--gpu-memory-utilization 0.90
--max-model-len 32768 (also tried 8192)
Key Observations:
AttributeError: I had to manually patch the tie_word_embeddings attribute error in /usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3*.py to get the model to load.
Performance: Logs show Avg prompt throughput: ~9.3 tokens/s and Avg generation throughput: ~3.5 tokens/s.
Fallback? Even with VLLM_USE_V1=1, it seems the system might be falling back to an unoptimized path for the Qwen3 architecture on SM 10.0.
Questions:
Does the current image include pre-compiled kernels for the Qwen3 architecture specifically targeted at SM 10.0?
Are there any specific environment variables or launch flags required to properly engage FlashInfer or CUDA Graphs for Qwen3 on this hardware?
Is Qwen3-VLās MRope operator supported in the optimized path of the V1 engine for Blackwell yet?
Any guidance or recommended launch parameters would be greatly appreciated!
Iāll try to dig deeper later on optimization, but:
(1) I couldnāt recreate AttributeError problemsāI thought it mightāve had to do with using Transformers 5.x container (-t5), I wouldāve recommended that you use the -t4 (Transformers 4.x) container, but it actually worked on both for me.
(2) You donāt need to add environment variables for VLLM_USE_V1=1 anymore really. the V0 engine is pretty much gone at this point.
(3) DGX Spark isnāt so great at dense FP16. I tried FLASH_ATTN, TRITON_ATTN, and FLASHINFER and all did get similar performance of ~3.5 tps. Itās possible there are other optimizations that can be tweaked and maybe Iāll try to look into it further to see if I can get more out of it butā¦
(4) Mixture of Experts models and quantized models do get better performance on the Spark. I tested Qwen/Qwen3-30B-A3B-Instruct-2507 after trying Qwen3-32B and get 25-30 tps (speed decreases with context size). In a way, thatās the about the same performance, we reduced the number of activations by 10x and got a 10x increase in speed. You should try to leverage MoE and quantization to get the most out of the Spark.
(5) The compute architecture for the DGX Spark is SM 12.1. It also requires CUDA 13. You can browse around the forums, but basically the transition to Blackwell SM120/121 (SM 10.x is Blackwell for other chips) and the transition to CUDA 13 has been slow. A lot has improved over the past few months on that front, but there is still a ways to go.
Here was my basis for launching:
Qwen3 32B docker run --privileged --gpus all -it --rm --network host --ipc=host -v ~/.cache/huggingface:/root/.cache/huggingface scitrera/dgx-spark-vllm:0.14.0-t4 vllm serve Qwen/Qwen3-32B --gpu-memory-utilization 0.75
Qwen3 30B-A3B-Instruct-2507 docker run --privileged --gpus all -it --rm --network host --ipc=host -v ~/.cache/huggingface:/root/.cache/huggingface scitrera/dgx-spark-vllm:0.14.0-t4 vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --gpu-memory-utilization 0.75
Thank you for your previous guidance! I have an update regarding my testing on the DGX Spark (GB10, 128GB Unified Memory).
I have tested the Qwen3-30B-A3B-Instruct-2507 (MoE architecture with ~3B active parameters) using both the 0.14.0-t4 and 0.14.0-t5 images. Interestingly, the performance is identical on both versions, suggesting that the current bottleneck is likely not version-specific.
Performance Metrics (both t4/t5):
Avg prompt throughput: 9.3 tokens/s
Avg generation throughput: 7.0 tokens/s
While 7 tokens/s is an improvement over the Dense 32B/8B models (which were at ~3.5 tokens/s), it still feels like we are not fully utilizing the Blackwellās potential, especially since the active parameters for this MoE model are only around 3B. I was hoping to see the 20-30 tokens/s range mentioned in other Spark benchmarks.
Since the throughput is consistent between t4 and t5, could this be an issue with how the Qwen3 MoE operators (specifically MRope or GGUF-based kernels) are being dispatched on SM 10.0?
Are there any hidden environment variables (like forcing VLLM_USE_V1=1 or specific FLASHINFER flags) that are required to bypass the 7 tokens/s ceiling?
In your experience, should I be seeing higher āPrompt Throughputā (currently 9.3) as well?
I really appreciate the work youāve put into these images and would love to hear any tips on how to unlock the āoverdriveā mode for this GB10 system.
I successfully deployed Qwen3-Coder-30B-A3B-Instruct-FP8 on a DGX Spark (GB10) using the new vLLM pre-built images. My goal is āVibe Codingā (heavy context/RAG usage with Tool Use).
Current Status: The setup is stable and responsive. I am seeing excellent Prefix Caching performance, but the raw generation throughput seems lower than I expected for an āActive 2.4Bā (A3B) MoE architecture.
Prompt Throughput: ~16,000 tokens/s (Excellent - Prefix Caching is working perfectly)
Generation Throughput: ~30.8 tokens/s (Stable)
Given that this model has only ~2.4B active parameters, I was hoping for higher generation speeds (80-100+ t/s) on Blackwell hardware.
My Configuration: I am running a single-GPU stack with qwen3_coder parser enabled.
version: '3.8'
services:
vllm-server:
image: nvcr.io/nvidia/vllm:25.12.post1-py3
container_name: vllm-Qwen3-Coder-30B-FP8
restart: unless-stopped
runtime: nvidia
ports:
- "YOUR_DESIRED_PORT:8000" # Example: 10001:8000
volumes:
# Map your local model folder to /model inside container
- YOUR_LOCAL_PATH_TO_MODEL_FOLDER:/model
# Map cache to persist downloads/compiler states
- YOUR_LOCAL_CACHE_FOLDER:/root/.cache/vllm
ipc: host
privileged: true
shm_size: 64g
environment:
# --- BLACKWELL / GB10 OPTIMIZATIONS ---
# Critical for performance on DGX Spark
- VLLM_ATTENTION_BACKEND=FLASHINFER
- VLLM_USE_FLASHINFER_MOE_FP8=1
- VLLM_FLASHINFER_MOE_BACKEND=latency
# --------------------------------------
- NCCL_P2P_DISABLE=0
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
command:
- vllm
- serve
- /model
- --dtype
- auto
- --tensor-parallel-size
- "1"
- --max-model-len
- "131072"
- --gpu-memory-utilization
- "0.80"
- --max-num-seqs
- "64"
- --enable-chunked-prefill
- --enable-auto-tool-choice
- --tool-call-parser
- qwen3_coder # Specific parser for Qwen Coder models
- --served-model-name
- qwen-coder # Short alias
- Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 # Full name for compatibility
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "5"
# networks:
# - YOUR_CUSTOM_NETWORK
or here the ai converted it into a docker run command. Have not tested that.
Hey that is cool. Could you provide the Dockerfile that build that image too? I want to build a Image for xTTS for the DGX Spark and that needs torchaudio and AArch64-Wheels for CUDA 13
Coming soon. Shocking how much time everything takes when doing 500x tasks and not 100% leaving it to claude. (Leaving things all to claude feels wonderful until you have to redo at least half the workā¦) Still trying to tweak my AI development process/workflow.
But you can build on top of this image: scitrera/dgx-spark-pytorch-dev:2.10.0-cu131
Thatās the pytorch base, so itāll have the pytorch and torchaudio parts handled for you. And itās a development image on top of nvidia/cuda:13.1.0-devel-ubuntu24.04 so all of the usual build stuff is already installed & ready.
First, wanted to say massive thanks to everyone contributing their skills and time for development of better inferencing builds on our sparks, you guys deserve free nvidia shares :).
Do we have a rough guide on which vllm images / builds are the fastest / best to use at the moment for AWK and NVFP4 quants?
Iām super impressed by speedup of GitHub - christopherowen/spark-vllm-mxfp4-docker build for gpt oss 120b, it would be amazing if we could get a similar performance boost with nvfp4 quants.