https://github.com/seli-equinix/docker-swarm-stacks/tree/main/asus-dgx-spark/vllm-docker
I came across this. The developer is interested in feedback. He says:
I’ve got vLLM fully working on the NVIDIA DGX Spark (Latest OS update) with the new GB10 Blackwell GPU (SM121). I have it running v0.14.0rc1 code as of this morning. It is a complete implementation with all the SM121-specific fixes needed to run on this hardware.
🎯 What’s Working
✅ Full 256K context length (model’s max capacity)
✅ ~45 tok/s on Qwen3-Next-80B-A3B-FP8
✅ FP8 quantization with Triton MoE backend
✅ Blackwell-class detection (is_blackwell_class() for SM10x/SM11x/SM12x)
- ✅ Proper backend fallbacks (TRTLLM → CUTLASS → Triton)
🔧 Key Technical Changes
The GB10 is SM121 (major=12), different from B100/B200 which are SM100/SM103 (major=10). This required:
-
New
is_blackwell_class()method - Unified detection for all Blackwell variants -
TRITON_ATTN backend - FlashInfer TRTLLM doesn’t support SM121 yet
-
Correct backend gating - TRTLLM/CUTLASS MLA restricted to SM100 only
-
KV cache layout fix - HND layout for SM121 like SM100
📦 Pre-built Docker Image (Easiest)
# Pull the image
docker pull hellohal2064/vllm-dgx-spark-gb10:latest
# Run
docker run -d --name vllm-server --gpus all -p 8000:8000 \
-v /path/to/models:/models:ro \
-e MODEL_PATH=/models/Qwen3-Next-80B-A3B-FP8 \
-e ATTENTION_BACKEND=TRITON_ATTN \
-e MAX_MODEL_LEN=262144 \
-e GPU_MEMORY_UTIL=0.85 \
hellohal2064/vllm-dgx-spark-gb10:latest
🛠️ Build From Source
# Clone Docker setup
git clone
https://github.com/seli-equinix/docker-swarm-stacks.git
cd docker-swarm-stacks/asus-dgx-spark/vllm-docker
# Clone vLLM with SM121 support
git clone https://github.com/seli-equinix/vllm.git
cd vllm && git checkout feature/sm121-gb10-support && cd ..
# Build
docker build -t vllm-gb10:latest .
⚙️ Environment Variables
Variable Description RecommendedMODEL_PATHModel path in container/models/YourModelATTENTION_BACKEND
Must be TRITON_ATTN for GB10TRITON_ATTNMAX_MODEL_LENContext length (up to 256K)262144GPU_MEMORY_UTILGPU memory fraction0.85
📊 Hardware Specs (DGX Spark)
Component Spec GPUNVIDIA GB10 (SM121 Blackwell)Compute Capability 12.1 Memory128GB unified (CPU+GPU shared) CPUARM64 NVIDIA Grace (20 cores) CUDA13.1+ required
🔗 Links
Docker Imagehellohal2064/vllm-dgx-spark-gb10:latest
Docker Setup & README https://github.com/seli-equinix/docker-swarm-stacks/tree/main/asus-dgx-spark/vllm-docker
vLLM Fork (SM121 branch) https://github.com/seli-equinix/vllm/tree/feature/sm121-gb10-support
Upstream PR #31740 https://github.com/vllm-project/vllm/pull/31740
⚠️ Known Limitations
-
FlashInfer TRTLLM attention not supported on SM121 (uses Triton)
-
MoE configs not tuned for GB10 yet (works with defaults)
-
DeepGEMM not supported on SM121
🙏 Looking For
-
Testers with DGX Spark hardware - Please try it and report issues!
-
Review on PR #31740 - Would appreciate maintainer feedback
-
MoE tuning help - Anyone interested in generating GB10-optimized configs?
Happy to answer questions. I am working to build the MoE Config for the GB10/SM121 Spark