Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table?

buildspark · March 16, 2026, 12:27pm

Absolutely — here’s the full build recipe.

The image is built in two stages: compile FlashInfer for SM121 (~3 hours, the painful part), then install vLLM + patches on top (~15 min).

Stage 1: Base image (FlashInfer + vLLM + SM121 patches)

Clone JungkwanBan’s spark_vllm_docker — this handles the FlashInfer compilation and core SM121 fixes:

git clone https://github.com/JungkwanBan/spark_vllm_docker.git
cd spark_vllm_docker

docker buildx build \
    --build-arg VLLM_VERSION="vllm==0.17.0rc1.dev216+ga3189a08b.cu130" \
    --build-arg FLASHINFER_REF=v0.6.1 \
    --build-arg CUTLASS_REF=main \
    --build-arg TRANSFORMERS_VER=5.2.0 \
    --build-arg BUILD_JOBS=16 \
    -t vllm-spark:base \
    --load .

That gives you vLLM 0.17.0rc1 with FlashInfer compiled for SM121, plus JungkwanBan’s patches (Blackwell class detection, TRITON_PTXAS_PATH, nogds force, AOT cache fix, fastsafetensors sort fix, MoE configs for E=256/E=512).

Stage 2: MTP patches

The base doesn’t include the MTP speculative decoding fixes for Qwen3.5. Our image adds ~12 inline Python patches on top for:

MTP draft model weight remapping (Qwen3.5 MTP layers)
MTP quant exclusion (draft model must stay BF16 even with FP8 KV cache)
mRoPE position handling fixes (full buffer, narrow, dynamic)
GDN NaN guard + Triton allocator fix for SM121
FlashInfer autotune cache persistence (saves profiling so reboots are faster)
Spec decode negative counter guard

These are all Python-level patches (no recompilation needed) applied via python3 -c in the Dockerfile. Each one is idempotent — checks for a marker string before applying.

Full build recipe, Dockerfile, and all patches on GitHub: GitHub - buildsparklabs/vllm-gb10-mtp: Pre-built vLLM for NVIDIA GB10 (DGX Spark) with MTP speculative decoding + FP8 KV cache. SM121 patches, FlashInfer compiled for SM121. · GitHub

Component versions:

Component	Version
Base	`nvcr.io/nvidia/pytorch:26.01-py3`
FlashInfer	v0.6.1 (compiled for SM121)
CUTLASS	main
vLLM	0.17.0rc1, commit `a3189a08b`
transformers	5.2.0

Heads up: Build must happen on the Spark itself (aarch64). FlashInfer compilation is the bottleneck at ~2-3 hours. The ccache mount in the Dockerfile helps a lot on rebuilds. And you need an NGC account to pull the base PyTorch image.

Happy to answer questions if anything is unclear or if you hit issues.

Topic		Replies	Views
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	431	21097	June 18, 2026
Custom built vLLM + Qwen3.5-35B on NVIDIA DGX Spark (GB10) — sustained 50 tok/s, 1M context DGX Spark / GB10	18	3936	May 7, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8677	March 14, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	308	26806	June 9, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	16774	March 24, 2026
Benchmark Report: Qwen3.6-35B-A3B-NVFP4 on NVIDIA DGX Spark, Jetson Thor, Blackwell 6000 Pro DGX Spark / GB10 Projects	10	2498	June 2, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	11220	April 9, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	10157	March 24, 2026
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	26	1934	April 28, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	32	16141	June 16, 2026

Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table?

Related topics