Absolutely — here’s the full build recipe.
The image is built in two stages: compile FlashInfer for SM121 (~3 hours, the painful part), then install vLLM + patches on top (~15 min).
Stage 1: Base image (FlashInfer + vLLM + SM121 patches)
Clone JungkwanBan’s spark_vllm_docker — this handles the FlashInfer compilation and core SM121 fixes:
git clone https://github.com/JungkwanBan/spark_vllm_docker.git
cd spark_vllm_docker
docker buildx build \
--build-arg VLLM_VERSION="vllm==0.17.0rc1.dev216+ga3189a08b.cu130" \
--build-arg FLASHINFER_REF=v0.6.1 \
--build-arg CUTLASS_REF=main \
--build-arg TRANSFORMERS_VER=5.2.0 \
--build-arg BUILD_JOBS=16 \
-t vllm-spark:base \
--load .
That gives you vLLM 0.17.0rc1 with FlashInfer compiled for SM121, plus JungkwanBan’s patches (Blackwell class detection, TRITON_PTXAS_PATH, nogds force, AOT cache fix, fastsafetensors sort fix, MoE configs for E=256/E=512).
Stage 2: MTP patches
The base doesn’t include the MTP speculative decoding fixes for Qwen3.5. Our image adds ~12 inline Python patches on top for:
- MTP draft model weight remapping (Qwen3.5 MTP layers)
- MTP quant exclusion (draft model must stay BF16 even with FP8 KV cache)
- mRoPE position handling fixes (full buffer, narrow, dynamic)
- GDN NaN guard + Triton allocator fix for SM121
- FlashInfer autotune cache persistence (saves profiling so reboots are faster)
- Spec decode negative counter guard
These are all Python-level patches (no recompilation needed) applied via python3 -c in the Dockerfile. Each one is idempotent — checks for a marker string before applying.
Full build recipe, Dockerfile, and all patches on GitHub: GitHub - buildsparklabs/vllm-gb10-mtp: Pre-built vLLM for NVIDIA GB10 (DGX Spark) with MTP speculative decoding + FP8 KV cache. SM121 patches, FlashInfer compiled for SM121. · GitHub
Component versions:
| Component | Version |
|---|---|
| Base | nvcr.io/nvidia/pytorch:26.01-py3 |
| FlashInfer | v0.6.1 (compiled for SM121) |
| CUTLASS | main |
| vLLM | 0.17.0rc1, commit a3189a08b |
| transformers | 5.2.0 |
Heads up: Build must happen on the Spark itself (aarch64). FlashInfer compilation is the bottleneck at ~2-3 hours. The ccache mount in the Dockerfile helps a lot on rebuilds. And you need an NGC account to pull the base PyTorch image.
Happy to answer questions if anything is unclear or if you hit issues.