Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

⚡ Update: v2 (post #71) achieves 51 tok/s. v2.1 (post #104) adds a quick-start script. See those posts for the latest setup.

Been chasing every last token/second out of Qwen3.5-122B-A10B on a single DGX Spark for the past few weeks. Not sure if anyone else is still optimizing this model on Spark, but figured I’d share what I found in case it saves someone a few weekends.

The short version: managed to get from 28.3 to 38.4 tok/s with no quality loss. Not exactly setting the world on fire, but it’s honest work.

What actually helped

Step tok/s Gain
Baseline (vLLM 0.19 + Intel AutoRound INT4 + FlashInfer) 28.3
+ Hybrid INT4+FP8 for shared expert dense layers 30.8 +8.8%
+ MTP-1 speculative decoding (95% acceptance rate) 38.4 +25%

The hybrid approach replaces the shared expert BF16 weights with FP8 from Qwen’s official FP8 checkpoint. Required a small patch to vLLM’s INC quantization config (~95 lines) to properly dispatch FP8 layers through CUTLASS instead of dropping them into UnquantizedLinearMethod (which was the default behavior — a bug, essentially).

The MTP part was a surprise. Intel AutoRound includes the MTP head weights (model_extra_tensors.safetensors, 4.8 GB) and references them in the index — so for vanilla Intel AutoRound, just pass --speculative-config '{"method":"mtp","num_speculative_tokens":1}' and you’re done. If you built a hybrid checkpoint, the MTP file and mappings aren’t carried over — use add-mtp-weights.py from the repo to add them back. Either way, you get 95% acceptance rate despite all the reported DeltaNet rollback issues (#36331, #36872). Turns out those bugs were caused by corrupted MTP weights in NVFP4 quantizations, not a fundamental architecture problem.

A note on Intel AutoRound INT4 quality

Let’s be honest — Intel/Qwen3.5-122B-A10B-int4-AutoRound is not perfect. It was quantized with default AutoRound parameters (iters=200, nsamples=128, seqlen=2048) which is… conservative, to put it politely. The model works, it’s the best publicly available INT4 option for this architecture, and we should be grateful it exists. But if someone with serious compute were to re-quantize with nsamples=256 and more calibration iterations, the quality improvement would be significant — lower perplexity, better coherence, fewer quantization artifacts. The speed would stay the same, but the answers would get noticeably better. Hint hint, Intel.

What didn’t help (so you don’t waste your time)

  • FP8 KV cache: +0.2 tok/s (noise)
  • NVFP4 (RedHatAI): 16.6 tok/s — slower than INT4 because FP4 CUTLASS kernels don’t work on SM121 yet
  • Triton native SM121 kernels replacing Marlin: 0% difference — it’s all memory-bandwidth bound
  • vLLM PR cherry-picks (#38990, #37700): 0% on v0.19.1
  • Rewriting Marlin for SM121: pointless — SM121 uses the same mma.sync as SM80, no new tensor core instructions

That last one was a painful lesson. SM121 is Blackwell in name but Ampere in ISA (for tensor cores, at least). The 3.65x speedups people report are on datacenter Blackwell (SM100/SM103) with native FP4 CUTLASS. Not us.

38.4 tok/s is likely the memory bandwidth ceiling for this model on a single Spark. We proved it by swapping kernel implementations (Marlin PTX vs Triton native) with zero difference — the GPU is just waiting for LPDDR5x at 273 GB/s. One petaflop of compute, patiently twiddling its thumbs while memory delivers data through a garden hose. The most expensive paperweight-that-could-be-faster-if-only-it-had-faster-RAM in my office.

Benchmark details (Run 2, warm cache)

Test Baseline Hybrid Hybrid+MTP
Q&A (256 tok) 28.3 30.8 37.8
Code (512 tok) 28.3 30.8 39.1
JSON (1024 tok) 28.4 30.9 39.0
Math (64 tok) 27.3 29.7 36.3
Long Code (2048 tok) 28.3 31.0 39.9

All patches, Dockerfile, benchmark script, and a step-by-step guide are here:

GitHub - albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4: Qwen3.5-122B-A10B on DGX Spark: 28.3 → 51 tok/s (+80%) · GitHub

Would love to hear if anyone has found other approaches or managed to go higher. Speculative decoding with more tokens (MTP-2, MTP-3) could theoretically push further, but Qwen3.5 only ships with 1 MTP layer.


This is my first post here, but I’ve been reading this forum religiously for months. Huge thanks to everyone who shares their findings — the hybrid quant pioneers, the NVFP4 explorers, the llama.cpp benchmarkers, and everyone debugging SM121 quirks in the trenches. You’ve all saved me countless hours. Figured it was time to give something back.

As the patches are python only, we can probably integrate this as mod @eugr ? :)

Nice thank you for the effort! Will give this a try, as this model is usually my daily driver.

Did you come across errors like this?

WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_a.weight
WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_b.weight
WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_qkv.weight
WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_qkv.weight_scale_inv
WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_z.weight
WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_z.weight_scale_inv

WARNING: proceeding despite 408 unexpected unmatched FP8 tensors because --force was provided

Thanks for sharing. I’ll give it a try … and feedback

Working mainly with the 35B version, but would love to switch to 122B, if only the speed would increase. Your solution looks promising.

If I understand your investigation correctly, to be able to benefit from MTP, we only have to build a new model version with:

python patches/02-mtp-speculative/add-mtp-weights.py \
    --source "$INTEL_DIR" \
    --target ~/models/qwen35-122b-hybrid-int4fp8

and all those you simply added the line to the launch script before got the placebo effect?

Yes, this is expected. The 408 unmatched FP8 tensors are mostly linear_attn projections (DeltaNet layers — 36 out of 48 layers in Qwen3.5) plus some attention norms and gates. They exist in the Qwen FP8 checkpoint but don’t have matching counterparts in the Intel AutoRound INT4 checkpoint because the naming conventions differ.
The script only replaces shared_expert dense layers (144 tensors) with FP8 — everything else stays in its original format (BF16/INT4). The --force flag is the right call here.
I verified output quality after building the checkpoint — math, code, Bayesian reasoning, language — no degradation compared to the pure INT4 baseline. The 408 skipped tensors are not a problem.
PS: Trust me, I’ve seen worse — when I tried to fix Gemma 4’s heterogeneous head_dim (256/512) to make FlashAttention work, I got garbage output and wasted time.

yes, MTP is the bigger win, and it works independently of the hybrid patch.
I haven’t actually benchmarked MTP on the plain Intel AutoRound INT4 checkpoint without the hybrid patch.
What I did test end-to-end:

  1. Baseline INT4 + FlashInfer: 28.3 tok/s (verified)
  2. Hybrid INT4+FP8 + FlashInfer: 30.8 tok/s (verified)
  3. Hybrid INT4+FP8 + FlashInfer + MTP-1: 38.4 tok/s (verified)
    If you want to skip the hybrid step and just try MTP on vanilla Intel AutoRound, it should work — the MTP weights are architecture-level, not quantization-dependent. But I can’t guarantee the exact number until someone tests it.

The whole point of this optimization work was to get Claude-level intelligence without Claude-level costs. Qwen3.5-122B scores 42 on the Artificial Analysis Intelligence Index — one point below Claude 4.5 Sonnet (43), and beats it on IFBench (76% vs 57%) and Humanity’s Last Exam (23% vs 17%).
That’s why every tok/s matters here — I’m not optimizing a benchmark toy, I’m trying to preserve that intelligence while making it actually usable for daily work. 38.4 tok/s of near-Claude reasoning, fully local, no API bills. The Spark paid for itself in about two months of not paying for cloud API tokens.
I also have Gemma 4 31B-IT at ~10 tok/s — same quality scores but 3.8x slower because 31B dense active params vs 10B MoE. On LPDDR5x, MoE architecture is the only way to run 100B+ class models at interactive speeds. If I could 3D-print a TPU at home, maybe Gemma 4 would win. But I can’t, so here we are.

Hybrid INT4+FP8: detected 144 FP8 dense layers (block_size=[128, 128])

Sounds good :D currently evaluating, as I am running this on the community docker, which is only vLLM 0.18.1rc1.dev41. But seems to work, thank you!

── Run 1/2 ──────────────────────────────────────
[Q&A] 256 tokens in 6.62s = 38.6 tok/s (prompt: 23)

[Code] 512 tokens in 12.77s = 40.0 tok/s (prompt: 30)
[JSON] 1024 tokens in 25.86s = 39.5 tok/s (prompt: 48)
[Math] 64 tokens in 1.74s = 36.7 tok/s (prompt: 29)

Llama-benchy doesnt see any increase in performance though? Might be related to how it is testing?

Ignore the model name :)

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
qwen-coder-next pp512 @ d2048 2089.65 ± 5.17 1227.46 ± 3.03 1225.57 ± 3.03 1227.54 ± 3.03
qwen-coder-next tg32 @ d2048 21.05 ± 0.56 22.33 ± 0.55
qwen-coder-next pp512 @ d2048 2080.09 ± 9.13 1233.11 ± 5.40 1231.22 ± 5.40 1233.18 ± 5.40
qwen-coder-next tg128 @ d2048 20.57 ± 0.06 22.00 ± 0.00
qwen-coder-next pp2048 @ d2048 2286.21 ± 1.46 1793.94 ± 1.15 1792.05 ± 1.15 1794.00 ± 1.14
qwen-coder-next tg32 @ d2048 20.46 ± 0.01 21.74 ± 0.01
qwen-coder-next pp2048 @ d2048 2219.17 ± 64.54 1849.64 ± 53.74 1847.75 ± 53.74 1849.72 ± 53.74
qwen-coder-next tg128 @ d2048 20.43 ± 0.07 21.00 ± 0.00
qwen-coder-next pp8192 @ d2048 2283.25 ± 1.21 4487.17 ± 2.38 4485.28 ± 2.38 4487.21 ± 2.39
qwen-coder-next tg32 @ d2048 20.80 ± 0.57 22.10 ± 0.61
qwen-coder-next pp8192 @ d2048 2279.22 ± 1.96 4495.09 ± 3.86 4493.20 ± 3.86 4495.13 ± 3.86
qwen-coder-next tg128 @ d2048 20.13 ± 0.06 21.00 ± 0.00
qwen-coder-next pp512 @ d12000 2209.33 ± 2.70 5665.60 ± 6.92 5663.71 ± 6.92 5665.64 ± 6.92
qwen-coder-next tg32 @ d12000 20.59 ± 0.48 21.88 ± 0.51
qwen-coder-next pp512 @ d12000 2207.36 ± 2.13 5670.66 ± 5.47 5668.77 ± 5.47 5670.70 ± 5.47
qwen-coder-next tg128 @ d12000 19.86 ± 0.09 21.00 ± 0.00
qwen-coder-next pp2048 @ d12000 2210.20 ± 1.59 6358.32 ± 4.56 6356.43 ± 4.56 6358.36 ± 4.57
qwen-coder-next tg32 @ d12000 20.10 ± 0.04 21.36 ± 0.05
qwen-coder-next pp2048 @ d12000 2200.47 ± 10.52 6386.37 ± 30.76 6384.48 ± 30.76 6386.44 ± 30.75
qwen-coder-next tg128 @ d12000 19.97 ± 0.07 21.00 ± 0.00
qwen-coder-next pp8192 @ d12000 2153.08 ± 1.11 9380.57 ± 4.86 9378.68 ± 4.86 9380.63 ± 4.84
qwen-coder-next tg32 @ d12000 20.95 ± 0.22 22.26 ± 0.23
qwen-coder-next pp8192 @ d12000 2153.40 ± 0.88 9379.14 ± 3.84 9377.25 ± 3.84 9379.24 ± 3.84
qwen-coder-next tg128 @ d12000 19.39 ± 0.02 20.00 ± 0.00

The script doesn’t seem to be doing anything for me:

Found 785 MTP tensors in source index
Added 0 MTP tensor mappings to index
Total tensors: 112901
Done. MTP speculative decoding is now available.

About the llama-benchy numbers — the difference is real, just measured differently.

Think of it this way: without MTP, the model does 1 decode step = 1 token. With MTP, the model does 1 decode step but produces ~2 tokens (1 regular + 1 speculative, 95% accepted).

llama-benchy measures decode steps per second — how fast the model runs forward passes. That’s ~20 steps/sec, and each step is actually a tiny bit slower now because of the MTP head overhead. So llama-benchy sees no improvement or even a slight slowdown.

bench_qwen35.sh and real chat measure what you actually get — tokens out divided by wall-clock time. 20 steps/sec × ~1.95 accepted tokens per step = ~39 tok/s. That’s the number you feel when using the model.

Both are correct:

  • ~20 tok/s = how fast the engine runs (decode steps)
  • ~38-40 tok/s = how fast you get your answer (effective throughput)

I see the same thing in my daily use — same prompt that used to take 26 seconds now finishes in 17.

That’s actually fine — the original Intel AutoRound checkpoint already has MTP tensor mappings in its index. The script found them all present, so nothing to add.

Just make sure the actual weights file exists in your checkpoint directory:

ls -lh /path/to/your/checkpoint/model_extra_tensors.safetensors

If it’s there (~X GB, maybe about 4Gb), you’re all set. Add --speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:1}’ to your launch command and enjoy ~30+ tok/s.

The hybrid step (my patches) is a separate optimization on top — MTP works independently.

Why run the script if the index already points to the weights?
I’m asking because you wrote:

The MTP part was a surprise. Intel AutoRound actually includes the MTP head weights (model_extra_tensors.safetensors, 4.8 GB) but doesn’t reference them in the model index

So I was expecting the index to be updated.

Hmm, good catch — need to correct that. The original Intel AutoRound checkpoint does have MTP in the index. The issue only shows up with the hybrid checkpoint from build-hybrid-checkpoint.py which doesn’t carry over MTP mappings. That’s what the script fixes.
So if you’re on vanilla Intel AutoRound — just add --speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:1}’ and you’re done.

Tested it just now:

── Run 2/2 (warm cache) ──────────────────────
[Q&A] 256 tokens in 7.08s = 36.1 tok/s

[Code] 502 tokens in 13.41s = 37.4 tok/s
[JSON] 1024 tokens in 28.05s = 36.5 tok/s
[Math] 64 tokens in 1.86s = 34.4 tok/s
[LongCode] 2048 tokens in 54.01s = 37.9 tok/s

  • Baseline (INT4 + FlashInfer): 28.3 tok/s
  • INT4 + MTP only: 36.5 tok/s (+29%)
  • Hybrid + MTP: 38.4 tok/s (+36%)

MTP alone is the biggest win. Hybrid adds ~2 tok/s on top.

Thanks for checking :)
MTP doesn’t work with the Pytorch backend, so I was hoping that it was because there was something wrong with the model, but now I’m going to have to try with Ray to see if I see the same kind of gains.

What error do you get with MTP? And which vLLM version / attention backend are you on? On 0.19 with FlashInfer it works out of the box, but there were several MTP-related bugs in earlier versions for Qwen3.5 (#36843, #36917).

This is only when using a cluster with eugr’s docker. The connection to the 2nd node never happens. The script gives up after 10 minutes.
No problem with Ray using the unmodified model, and up to 56t/s in agt from the logs when using OWUI, with 85%-100% acceptance rate.

The latest is v0.19 btw

The cached TF5 version is 0.18 and the main build is currently broken according to eugr. So I did not dare to go to 0.19 :D

Maybe you caught it at a bad time? This is what I’m running right now:

[utils.py:299]        █     █     █▄   ▄█
[utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev15+g50cd5674b.d20260403
[utils.py:299]   █▄█▀ █     █     █     █  model   Intel/Qwen3.5-122B-A10B-int4-AutoRound
[utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀