⚡ Update: v2 (post #71) achieves 51 tok/s. v2.1 (post #104) adds a quick-start script. See those posts for the latest setup.
Been chasing every last token/second out of Qwen3.5-122B-A10B on a single DGX Spark for the past few weeks. Not sure if anyone else is still optimizing this model on Spark, but figured I’d share what I found in case it saves someone a few weekends.
The short version: managed to get from 28.3 to 38.4 tok/s with no quality loss. Not exactly setting the world on fire, but it’s honest work.
What actually helped
| Step | tok/s | Gain |
|---|---|---|
| Baseline (vLLM 0.19 + Intel AutoRound INT4 + FlashInfer) | 28.3 | — |
| + Hybrid INT4+FP8 for shared expert dense layers | 30.8 | +8.8% |
| + MTP-1 speculative decoding (95% acceptance rate) | 38.4 | +25% |
The hybrid approach replaces the shared expert BF16 weights with FP8 from Qwen’s official FP8 checkpoint. Required a small patch to vLLM’s INC quantization config (~95 lines) to properly dispatch FP8 layers through CUTLASS instead of dropping them into UnquantizedLinearMethod (which was the default behavior — a bug, essentially).
The MTP part was a surprise. Intel AutoRound includes the MTP head weights (model_extra_tensors.safetensors, 4.8 GB) and references them in the index — so for vanilla Intel AutoRound, just pass --speculative-config '{"method":"mtp","num_speculative_tokens":1}' and you’re done. If you built a hybrid checkpoint, the MTP file and mappings aren’t carried over — use add-mtp-weights.py from the repo to add them back. Either way, you get 95% acceptance rate despite all the reported DeltaNet rollback issues (#36331, #36872). Turns out those bugs were caused by corrupted MTP weights in NVFP4 quantizations, not a fundamental architecture problem.
A note on Intel AutoRound INT4 quality
Let’s be honest — Intel/Qwen3.5-122B-A10B-int4-AutoRound is not perfect. It was quantized with default AutoRound parameters (iters=200, nsamples=128, seqlen=2048) which is… conservative, to put it politely. The model works, it’s the best publicly available INT4 option for this architecture, and we should be grateful it exists. But if someone with serious compute were to re-quantize with nsamples=256 and more calibration iterations, the quality improvement would be significant — lower perplexity, better coherence, fewer quantization artifacts. The speed would stay the same, but the answers would get noticeably better. Hint hint, Intel.
What didn’t help (so you don’t waste your time)
- FP8 KV cache: +0.2 tok/s (noise)
- NVFP4 (RedHatAI): 16.6 tok/s — slower than INT4 because FP4 CUTLASS kernels don’t work on SM121 yet
- Triton native SM121 kernels replacing Marlin: 0% difference — it’s all memory-bandwidth bound
- vLLM PR cherry-picks (#38990, #37700): 0% on v0.19.1
- Rewriting Marlin for SM121: pointless — SM121 uses the same
mma.syncas SM80, no new tensor core instructions
That last one was a painful lesson. SM121 is Blackwell in name but Ampere in ISA (for tensor cores, at least). The 3.65x speedups people report are on datacenter Blackwell (SM100/SM103) with native FP4 CUTLASS. Not us.
38.4 tok/s is likely the memory bandwidth ceiling for this model on a single Spark. We proved it by swapping kernel implementations (Marlin PTX vs Triton native) with zero difference — the GPU is just waiting for LPDDR5x at 273 GB/s. One petaflop of compute, patiently twiddling its thumbs while memory delivers data through a garden hose. The most expensive paperweight-that-could-be-faster-if-only-it-had-faster-RAM in my office.
Benchmark details (Run 2, warm cache)
| Test | Baseline | Hybrid | Hybrid+MTP |
|---|---|---|---|
| Q&A (256 tok) | 28.3 | 30.8 | 37.8 |
| Code (512 tok) | 28.3 | 30.8 | 39.1 |
| JSON (1024 tok) | 28.4 | 30.9 | 39.0 |
| Math (64 tok) | 27.3 | 29.7 | 36.3 |
| Long Code (2048 tok) | 28.3 | 31.0 | 39.9 |
All patches, Dockerfile, benchmark script, and a step-by-step guide are here:
Would love to hear if anyone has found other approaches or managed to go higher. Speculative decoding with more tokens (MTP-2, MTP-3) could theoretically push further, but Qwen3.5 only ships with 1 MTP layer.
This is my first post here, but I’ve been reading this forum religiously for months. Huge thanks to everyone who shares their findings — the hybrid quant pioneers, the NVFP4 explorers, the llama.cpp benchmarkers, and everyone debugging SM121 quirks in the trenches. You’ve all saved me countless hours. Figured it was time to give something back.