Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table?

Absolutely — here’s the full build recipe.

The image is built in two stages: compile FlashInfer for SM121 (~3 hours, the painful part), then install vLLM + patches on top (~15 min).

Stage 1: Base image (FlashInfer + vLLM + SM121 patches)

Clone JungkwanBan’s spark_vllm_docker — this handles the FlashInfer compilation and core SM121 fixes:

git clone https://github.com/JungkwanBan/spark_vllm_docker.git
cd spark_vllm_docker

docker buildx build \
    --build-arg VLLM_VERSION="vllm==0.17.0rc1.dev216+ga3189a08b.cu130" \
    --build-arg FLASHINFER_REF=v0.6.1 \
    --build-arg CUTLASS_REF=main \
    --build-arg TRANSFORMERS_VER=5.2.0 \
    --build-arg BUILD_JOBS=16 \
    -t vllm-spark:base \
    --load .

That gives you vLLM 0.17.0rc1 with FlashInfer compiled for SM121, plus JungkwanBan’s patches (Blackwell class detection, TRITON_PTXAS_PATH, nogds force, AOT cache fix, fastsafetensors sort fix, MoE configs for E=256/E=512).

Stage 2: MTP patches

The base doesn’t include the MTP speculative decoding fixes for Qwen3.5. Our image adds ~12 inline Python patches on top for:

  • MTP draft model weight remapping (Qwen3.5 MTP layers)
  • MTP quant exclusion (draft model must stay BF16 even with FP8 KV cache)
  • mRoPE position handling fixes (full buffer, narrow, dynamic)
  • GDN NaN guard + Triton allocator fix for SM121
  • FlashInfer autotune cache persistence (saves profiling so reboots are faster)
  • Spec decode negative counter guard

These are all Python-level patches (no recompilation needed) applied via python3 -c in the Dockerfile. Each one is idempotent — checks for a marker string before applying.

Full build recipe, Dockerfile, and all patches on GitHub: GitHub - buildsparklabs/vllm-gb10-mtp: Pre-built vLLM for NVIDIA GB10 (DGX Spark) with MTP speculative decoding + FP8 KV cache. SM121 patches, FlashInfer compiled for SM121. · GitHub

Component versions:

Component Version
Base nvcr.io/nvidia/pytorch:26.01-py3
FlashInfer v0.6.1 (compiled for SM121)
CUTLASS main
vLLM 0.17.0rc1, commit a3189a08b
transformers 5.2.0

Heads up: Build must happen on the Spark itself (aarch64). FlashInfer compilation is the bottleneck at ~2-3 hours. The ccache mount in the Dockerfile helps a lot on rebuilds. And you need an NGC account to pull the base PyTorch image.

Happy to answer questions if anything is unclear or if you hit issues.