New pre-built vLLM Docker Images for NVIDIA DGX Spark

šŸš€ New vLLM Docker Images for NVIDIA DGX Spark

Quickly publishing initial vLLM Docker images optimized for NVIDIA DGX Spark (Blackwell-ready, NCCL + PyTorch rebuilt).

Available images (so far):

  • scitrera/dgx-spark-vllm:0.13.0-t4 — vLLM 0.13.0, PyTorch 2.9.1, CUDA 13.0.2, Transformers 4.57.5, Triton 3.5.1, NCCL 2.28.9-1
  • scitrera/dgx-spark-vllm:0.14.0rc2-t4 — vLLM 0.14.0rc2, PyTorch 2.10.0-rc6, CUDA 13.1.0, Transformers 4.57.5, Triton 3.5.1, NCCL 2.28.9-1
  • scitrera/dgx-spark-vllm:0.14.0rc2-t5 — vLLM 0.14.0rc2, PyTorch 2.10.0-rc6, CUDA 13.1.0, Transformers 5.0.0rc3, Triton 3.5.1, NCCL 2.28.9-1

Both images include Ray for multi-node / cluster deployments. Will be adding transformers 5 variants soon to enable use for GLM-4.6V.


Example usage

docker run \
  --privileged \
  --gpus all \
  -it --rm \
  --network host --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  scitrera/dgx-spark-vllm:0.13.0-t4 \
  vllm serve \
    Qwen/Qwen3-1.7B \
    --gpu-memory-utilization 0.7

Tag semantics

  • -t4 → Transformers 4.x

    • Example: 0.13.0-t4 = vLLM 0.13.0 + Transformers 4.57.5
  • -t5 → Transformers 5.x (pre-release)


Inspecting package versions

Major component versions are embedded as Docker labels:

docker inspect scitrera/dgx-spark-vllm:0.14.0rc2-t4 \
  --format '{{json .Config.Labels}}' | jq

Example output:

{
  "dev.scitrera.cuda_version": "13.1.0",
  "dev.scitrera.flashinfer_version": "0.6.1",
  "dev.scitrera.nccl_version": "2.28.9-1",
  "dev.scitrera.torch_version": "2.10.0-rc6",
  "dev.scitrera.transformers_version": "4.57.5",
  "dev.scitrera.triton_version": "3.5.1",
  "dev.scitrera.vllm_version": "0.14.0rc2"
}

Notes

  • Updated NCCL (versus PyTorch 2.9.1)
  • PyTorch, Triton, and vLLM are rebuilt accordingly
  • These images are early / experimental

For faster iteration on vLLM, I’d recommend @eugr’s repo:
šŸ‘‰ https://github.com/eugr/spark-vllm-docker.

long-term maintenance, support, and feedback plans are still TBD


4 Likes

Thank you for the information. Just want to be clear, are these images created by you and/or the community?

Yes. So far, the containers are just fresh builds (I made) of curated version mixes of the official open source packages. I have gotten a bit bogged down with other work, but I will release github repo for build script & Dockerfiles that includes ā€œrecipeā€ files that coordinate all of the versions.

I’m planning to keep using my (currently) 4x spark cluster, so I plan to maintain these builds for a while since they help me try out the latest models. I basically want to enforce more strict versioning than always using nightly builds (plus e.g. latest vllm nightlies don’t necessarily vary pytorch build, etc.) But I also can accept more risk than NVIDIA in terms of how the monthly containers are released. So I see this as a community stopgap/intermediate to be more bleeding edge than the official NVIDIA containers but somewhat more curated than random mixture of nightly builds.

Somewhat of an aside, but I’m also planning to release (open source either BSD or Apache2 license) something that’s like a mashup of the nvidia sync, the DGX dashboard, and the playbooks to try to make it quick & easy for people to get started and to manage CX7 networking / spark clusters.

1 Like

I definitely need to try out these docker images when I get time, since I mostly use llama-server.

I think vLLM had its v0.14 release today (not -rc anymore)

I’m planning to release vllm 0.14.0 version tomorrow maybe? I’m waiting for official release of next pytorch version.

2 Likes
  • scitrera/dgx-spark-pytorch-dev:2.10.0-cu131 — PyTorch 2.10.0 (with corresponding torchvision and torchaudio), CUDA 13.1.0, NCCL 2.29.2-1

  • scitrera/dgx-spark-vllm:0.14.0-t4 — vLLM 0.14.0ish (git 9b693d0), PyTorch 2.10.0, CUDA 13.1.0, Transformers 4.57.6, Triton 3.5.1, NCCL 2.29.2-1

  • scitrera/dgx-spark-vllm:0.14.0-t5 — vLLM 0.14.0ish (git 63227ac), PyTorch 2.10.0, CUDA 13.1.0, Transformers 5.0.0-rc? (git 0dfb28e1), Triton 3.5.1, NCCL 2.29.2-1; Patch for is_deepseek_mla() included for the benefit of GLM-4.7-Flash.

FYI. Tested GLM-4.7-Flash using ray and -tp4 on 4 spark cluster using scitrera/dgx-spark-vllm:0.14.0-t5 image.

FYI. Tested GLM-4.7-Flash using ray and -tp4 on 4 spark cluster using scitrera/dgx-spark-vllm:0.14.0-t5 image.

Brilliant, I hate to burden you with silly questions but I wonder if you have a quick step by step guide to getting GLM-4.7-Flash in particular running using these new images? I also heard there was a fix that was required for that model so that it wouldn’t create a 180GB KV Cache. Is the fixed version the one that works with this new vllm image or the one with the KV Cache misconfiguration?

1 Like

That fix/workaround is not included; now it’s time for a decision. I haven’t tested it yet and it might work fine on spark, but that fix/adjustment does not work properly right now on B200–which is why it hasn’t been merged into vllm yet. So I guess as long as it works on DGX spark–then these spark-specific images should have the patch.

I’ve updated the -t5 image to include the patch since it seems to be working properly when tested on my sparks. Here is branch of vllm fork for completeness: GitHub - scitrera/vllm at v0.14.0+glm4-moe-lite-mla

And regarding guide… I can convert commands, etc. to a ā€œguideā€ via chatGPT, but what’s your setup? 1x spark? 2x sparks? If 2x or more, do you already have them connected to each other and setup?

Thank you once more, and would really appreciate even a chatgpt generated guide to get this running.

Maybe you could make a guide for 1 spark, and in your testing, what tokens per second and memory use is there running the 3.7 flash with the patch at nvfp4 with context size set to about 100k? if you have two sparks, it would also be interesting to understand how this scales with two sparks.

I appreciate this is three tasks, the document is most useful of course because right now, many of us can’t run this at all.

Thank you!

Here is link for ā€œguideā€ (draft/first pass): [Request] GLM-4.7-Flash AWQ/NVFP4 Instructions

I’ll get back to you on other parts; unfortunately I only have 4x sparks–which while that sounds like a lot; it’s not if you’re using them for work…

First of all, thank you for providing these dedicated vLLM images for DGX Spark. I am currently testing the scitrera/dgx-spark-vllm:0.14.0-t5 image on a DGX Spark GB10 (128GB), but I’ve encountered significant performance issues that I hope you can help with.

System Specs:

  • Hardware: NVIDIA DGX Spark (Grace-Blackwell GB10), 128GB Unified Memory.

  • Model: Qwen/Qwen3-VL-32B-Instruct and Qwen/Qwen3-32B.

  • Image: scitrera/dgx-spark-vllm:0.14.0-t5.

The Issue:
I am seeing extremely low throughput. Whether using the VL or the pure text version of Qwen3-32B, the generation speed is stuck at ~3 tokens/s. For a Blackwell-based system, I was expecting significantly higher performance.

Configuration Used:

  • VLLM_USE_V1=1

  • VLLM_V1_ENABLED=1

  • VLLM_ATTENTION_BACKEND=FLASHINFER

  • --gpu-memory-utilization 0.90

  • --max-model-len 32768 (also tried 8192)

Key Observations:

  1. AttributeError: I had to manually patch the tie_word_embeddings attribute error in /usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3*.py to get the model to load.

  2. Performance: Logs show Avg prompt throughput: ~9.3 tokens/s and Avg generation throughput: ~3.5 tokens/s.

  3. Fallback? Even with VLLM_USE_V1=1, it seems the system might be falling back to an unoptimized path for the Qwen3 architecture on SM 10.0.

Questions:

  1. Does the current image include pre-compiled kernels for the Qwen3 architecture specifically targeted at SM 10.0?

  2. Are there any specific environment variables or launch flags required to properly engage FlashInfer or CUDA Graphs for Qwen3 on this hardware?

  3. Is Qwen3-VL’s MRope operator supported in the optimized path of the V1 engine for Blackwell yet?

Any guidance or recommended launch parameters would be greatly appreciated!

I’ll try to dig deeper later on optimization, but:

(1) I couldn’t recreate AttributeError problems–I thought it might’ve had to do with using Transformers 5.x container (-t5), I would’ve recommended that you use the -t4 (Transformers 4.x) container, but it actually worked on both for me.

(2) You don’t need to add environment variables for VLLM_USE_V1=1 anymore really. the V0 engine is pretty much gone at this point.

(3) DGX Spark isn’t so great at dense FP16. I tried FLASH_ATTN, TRITON_ATTN, and FLASHINFER and all did get similar performance of ~3.5 tps. It’s possible there are other optimizations that can be tweaked and maybe I’ll try to look into it further to see if I can get more out of it but…

(4) Mixture of Experts models and quantized models do get better performance on the Spark. I tested Qwen/Qwen3-30B-A3B-Instruct-2507 after trying Qwen3-32B and get 25-30 tps (speed decreases with context size). In a way, that’s the about the same performance, we reduced the number of activations by 10x and got a 10x increase in speed. You should try to leverage MoE and quantization to get the most out of the Spark.

(5) The compute architecture for the DGX Spark is SM 12.1. It also requires CUDA 13. You can browse around the forums, but basically the transition to Blackwell SM120/121 (SM 10.x is Blackwell for other chips) and the transition to CUDA 13 has been slow. A lot has improved over the past few months on that front, but there is still a ways to go.

Here was my basis for launching:

Qwen3 32B
docker run --privileged --gpus all -it --rm --network host --ipc=host -v ~/.cache/huggingface:/root/.cache/huggingface scitrera/dgx-spark-vllm:0.14.0-t4 vllm serve Qwen/Qwen3-32B --gpu-memory-utilization 0.75

Qwen3 30B-A3B-Instruct-2507
docker run --privileged --gpus all -it --rm --network host --ipc=host -v ~/.cache/huggingface:/root/.cache/huggingface scitrera/dgx-spark-vllm:0.14.0-t4 vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --gpu-memory-utilization 0.75

Thank you for your previous guidance! I have an update regarding my testing on the DGX Spark (GB10, 128GB Unified Memory).

I have tested the Qwen3-30B-A3B-Instruct-2507 (MoE architecture with ~3B active parameters) using both the 0.14.0-t4 and 0.14.0-t5 images. Interestingly, the performance is identical on both versions, suggesting that the current bottleneck is likely not version-specific.

Performance Metrics (both t4/t5):

  • Avg prompt throughput: 9.3 tokens/s

  • Avg generation throughput: 7.0 tokens/s

While 7 tokens/s is an improvement over the Dense 32B/8B models (which were at ~3.5 tokens/s), it still feels like we are not fully utilizing the Blackwell’s potential, especially since the active parameters for this MoE model are only around 3B. I was hoping to see the 20-30 tokens/s range mentioned in other Spark benchmarks.

My Setup & Command:

codeBash

docker run --privileged --gpus all -it --rm --network host --ipc=host \
  -v ~/ai-stack/llm/llm_server/model_cache:/models \
  scitrera/dgx-spark-vllm:0.14.0-t5 \
  vllm serve /models/Qwen/Qwen3-30B-A3B-Instruct-2507 --gpu-memory-utilization 0.75

Observations & Questions:

  1. Since the throughput is consistent between t4 and t5, could this be an issue with how the Qwen3 MoE operators (specifically MRope or GGUF-based kernels) are being dispatched on SM 10.0?

  2. Are there any hidden environment variables (like forcing VLLM_USE_V1=1 or specific FLASHINFER flags) that are required to bypass the 7 tokens/s ceiling?

  3. In your experience, should I be seeing higher ā€œPrompt Throughputā€ (currently 9.3) as well?

I really appreciate the work you’ve put into these images and would love to hear any tips on how to unlock the ā€œoverdriveā€ mode for this GB10 system.

Thank you!

Hi everyone,

I successfully deployed Qwen3-Coder-30B-A3B-Instruct-FP8 on a DGX Spark (GB10) using the new vLLM pre-built images. My goal is ā€œVibe Codingā€ (heavy context/RAG usage with Tool Use).

Current Status: The setup is stable and responsive. I am seeing excellent Prefix Caching performance, but the raw generation throughput seems lower than I expected for an ā€œActive 2.4Bā€ (A3B) MoE architecture.

  • Prompt Throughput: ~16,000 tokens/s (Excellent - Prefix Caching is working perfectly)
  • Generation Throughput: ~30.8 tokens/s (Stable)

Given that this model has only ~2.4B active parameters, I was hoping for higher generation speeds (80-100+ t/s) on Blackwell hardware.

My Configuration: I am running a single-GPU stack with qwen3_coder parser enabled.

version: '3.8'

services:
  vllm-server:
    image: nvcr.io/nvidia/vllm:25.12.post1-py3
    container_name: vllm-Qwen3-Coder-30B-FP8
    restart: unless-stopped
    runtime: nvidia
    ports:
      - "YOUR_DESIRED_PORT:8000"  # Example: 10001:8000
    volumes:
      # Map your local model folder to /model inside container
      - YOUR_LOCAL_PATH_TO_MODEL_FOLDER:/model
      # Map cache to persist downloads/compiler states
      - YOUR_LOCAL_CACHE_FOLDER:/root/.cache/vllm
    ipc: host
    privileged: true
    shm_size: 64g
    environment:
      # --- BLACKWELL / GB10 OPTIMIZATIONS ---
      # Critical for performance on DGX Spark
      - VLLM_ATTENTION_BACKEND=FLASHINFER
      - VLLM_USE_FLASHINFER_MOE_FP8=1
      - VLLM_FLASHINFER_MOE_BACKEND=latency
      # --------------------------------------
      - NCCL_P2P_DISABLE=0
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

    command: 
      - vllm
      - serve
      - /model
      - --dtype
      - auto
      - --tensor-parallel-size
      - "1"
      - --max-model-len
      - "131072"
      - --gpu-memory-utilization
      - "0.80"
      - --max-num-seqs
      - "64"
      - --enable-chunked-prefill
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder             # Specific parser for Qwen Coder models
      - --served-model-name
      - qwen-coder              # Short alias
      - Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8  # Full name for compatibility
      
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "5"
        
    # networks:
    #   - YOUR_CUSTOM_NETWORK

or here the ai converted it into a docker run command. Have not tested that.

docker run -d \
  --name vllm-Qwen3-Coder-30B-FP8 \
  --restart unless-stopped \
  --runtime nvidia \
  --gpus all \
  --ipc=host \
  --privileged \
  --shm-size 64g \
  -p YOUR_DESIRED_PORT:8000 \
  -v YOUR_LOCAL_PATH_TO_MODEL_FOLDER:/model \
  -v YOUR_LOCAL_CACHE_FOLDER:/root/.cache/vllm \
  -e VLLM_ATTENTION_BACKEND=FLASHINFER \
  -e VLLM_USE_FLASHINFER_MOE_FP8=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=latency \
  -e NCCL_P2P_DISABLE=0 \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=all \
  --log-driver json-file \
  --log-opt max-size=50m \
  --log-opt max-file=5 \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  vllm serve /model \
  --dtype auto \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 64 \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --served-model-name qwen-coder \
  Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

Latest images to coincide with vllm 0.14.1 release

  • scitrera/dgx-spark-vllm:0.14.1-t4 — vLLM 0.14.1, PyTorch 2.10.0 (includes torchvision and torchaudio), CUDA 13.1.0, Transformers 4.57.6, Triton 3.5.1 (3.6.0 not yet compatible), NCCL 2.29.2-1, FlashInfer 0.6.2

  • scitrera/dgx-spark-vllm:0.14.1-t5 — vLLM 0.14.1, PyTorch 2.10.0 (includes torchvision and torchaudio), CUDA 13.1.0, Transformers 5.0.0, Triton 3.5.1 (3.6.0 not yet compatible), NCCL 2.29.2-1, FlashInfer 0.6.2

1 Like

Notice: Updated scitrera/dgx-spark-vllm:0.14.1-t5 to include Transformers v5 release version

1 Like

Hey that is cool. Could you provide the Dockerfile that build that image too? I want to build a Image for xTTS for the DGX Spark and that needs torchaudio and AArch64-Wheels for CUDA 13

Coming soon. Shocking how much time everything takes when doing 500x tasks and not 100% leaving it to claude. (Leaving things all to claude feels wonderful until you have to redo at least half the work…) Still trying to tweak my AI development process/workflow.

But you can build on top of this image: scitrera/dgx-spark-pytorch-dev:2.10.0-cu131

That’s the pytorch base, so it’ll have the pytorch and torchaudio parts handled for you. And it’s a development image on top of nvidia/cuda:13.1.0-devel-ubuntu24.04 so all of the usual build stuff is already installed & ready.

1 Like

First, wanted to say massive thanks to everyone contributing their skills and time for development of better inferencing builds on our sparks, you guys deserve free nvidia shares :).
Do we have a rough guide on which vllm images / builds are the fastest / best to use at the moment for AWK and NVFP4 quants?
I’m super impressed by speedup of GitHub - christopherowen/spark-vllm-mxfp4-docker build for gpt oss 120b, it would be amazing if we could get a similar performance boost with nvfp4 quants.

You can use my Docker build for that. It includes torchaudio: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

1 Like