Some new development work for Qwen3 on the Spark

https://github.com/seli-equinix/docker-swarm-stacks/tree/main/asus-dgx-spark/vllm-docker

I came across this. The developer is interested in feedback. He says:

I’ve got vLLM fully working on the NVIDIA DGX Spark (Latest OS update) with the new GB10 Blackwell GPU (SM121). I have it running v0.14.0rc1 code as of this morning. It is a complete implementation with all the SM121-specific fixes needed to run on this hardware.
🎯 What’s Working
Full 256K context length (model’s max capacity)
~45 tok/s on Qwen3-Next-80B-A3B-FP8
FP8 quantization with Triton MoE backend
Blackwell-class detection (is_blackwell_class() for SM10x/SM11x/SM12x)

  • Proper backend fallbacks (TRTLLM → CUTLASS → Triton)

🔧 Key Technical Changes
The GB10 is SM121 (major=12), different from B100/B200 which are SM100/SM103 (major=10). This required:

  1. New is_blackwell_class() method - Unified detection for all Blackwell variants

  2. TRITON_ATTN backend - FlashInfer TRTLLM doesn’t support SM121 yet

  3. Correct backend gating - TRTLLM/CUTLASS MLA restricted to SM100 only

  4. KV cache layout fix - HND layout for SM121 like SM100

📦 Pre-built Docker Image (Easiest)

# Pull the image
docker pull hellohal2064/vllm-dgx-spark-gb10:latest

# Run
docker run -d --name vllm-server --gpus all -p 8000:8000 \
  -v /path/to/models:/models:ro \
  -e MODEL_PATH=/models/Qwen3-Next-80B-A3B-FP8 \
  -e ATTENTION_BACKEND=TRITON_ATTN \
  -e MAX_MODEL_LEN=262144 \
  -e GPU_MEMORY_UTIL=0.85 \
  hellohal2064/vllm-dgx-spark-gb10:latest

🛠️ Build From Source

# Clone Docker setup
git clone 
https://github.com/seli-equinix/docker-swarm-stacks.git
cd docker-swarm-stacks/asus-dgx-spark/vllm-docker

# Clone vLLM with SM121 support
git clone https://github.com/seli-equinix/vllm.git
cd vllm && git checkout feature/sm121-gb10-support && cd ..

# Build
docker build -t vllm-gb10:latest .

⚙️ Environment Variables
Variable Description RecommendedMODEL_PATHModel path in container/models/YourModelATTENTION_BACKEND
Must be TRITON_ATTN for GB10TRITON_ATTNMAX_MODEL_LENContext length (up to 256K)262144GPU_MEMORY_UTILGPU memory fraction0.85
📊 Hardware Specs (DGX Spark)
Component Spec GPUNVIDIA GB10 (SM121 Blackwell)Compute Capability 12.1 Memory128GB unified (CPU+GPU shared) CPUARM64 NVIDIA Grace (20 cores) CUDA13.1+ required
🔗 Links
Docker Imagehellohal2064/vllm-dgx-spark-gb10:latest
Docker Setup & README https://github.com/seli-equinix/docker-swarm-stacks/tree/main/asus-dgx-spark/vllm-docker
vLLM Fork (SM121 branch) https://github.com/seli-equinix/vllm/tree/feature/sm121-gb10-support
Upstream PR #31740 https://github.com/vllm-project/vllm/pull/31740
⚠️ Known Limitations

  • FlashInfer TRTLLM attention not supported on SM121 (uses Triton)

  • MoE configs not tuned for GB10 yet (works with defaults)

  • DeepGEMM not supported on SM121

🙏 Looking For

  • Testers with DGX Spark hardware - Please try it and report issues!

  • Review on PR #31740 - Would appreciate maintainer feedback

  • MoE tuning help - Anyone interested in generating GB10-optimized configs?

Happy to answer questions. I am working to build the MoE Config for the GB10/SM121 Spark

3 Likes

404 –

v0.14.0rc1 or 0.15 release candidate? 0.14 release was out few weeks ago.

Looks promising though! Did you try Qwen3 Coder 30B by any chance?

Sounds very interesting, but seems that repo is not available anymore.

I think is that one:

and

hello - am the one working on getting this into the main build for vllm - I updated it yesterday. It should all be there. feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support by seli-equinix · Pull Request #31740 · vllm-project/vllm · GitHub - feature/sm121-gb10-support is the correct fork. I just pushed a new container to dockerhub, it is running the latest 0.16 vllm and i have gotten it to load the qwen models much faster. I now have it running with YaRN and a functioning 1M context window. I also updated all the docs on dockerhub.

2 Likes