Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Albond · April 5, 2026, 8:02am

⚡ Update: v2 (post #71) achieves 51 tok/s. v2.1 (post #104) adds a quick-start script. See those posts for the latest setup.

Been chasing every last token/second out of Qwen3.5-122B-A10B on a single DGX Spark for the past few weeks. Not sure if anyone else is still optimizing this model on Spark, but figured I’d share what I found in case it saves someone a few weekends.

The short version: managed to get from 28.3 to 38.4 tok/s with no quality loss. Not exactly setting the world on fire, but it’s honest work.

What actually helped

Step	tok/s	Gain
Baseline (vLLM 0.19 + Intel AutoRound INT4 + FlashInfer)	28.3	—
+ Hybrid INT4+FP8 for shared expert dense layers	30.8	+8.8%
+ MTP-1 speculative decoding (95% acceptance rate)	38.4	+25%

The hybrid approach replaces the shared expert BF16 weights with FP8 from Qwen’s official FP8 checkpoint. Required a small patch to vLLM’s INC quantization config (~95 lines) to properly dispatch FP8 layers through CUTLASS instead of dropping them into UnquantizedLinearMethod (which was the default behavior — a bug, essentially).

The MTP part was a surprise. Intel AutoRound includes the MTP head weights (model_extra_tensors.safetensors, 4.8 GB) and references them in the index — so for vanilla Intel AutoRound, just pass --speculative-config '{"method":"mtp","num_speculative_tokens":1}' and you’re done. If you built a hybrid checkpoint, the MTP file and mappings aren’t carried over — use add-mtp-weights.py from the repo to add them back. Either way, you get 95% acceptance rate despite all the reported DeltaNet rollback issues (#36331, #36872). Turns out those bugs were caused by corrupted MTP weights in NVFP4 quantizations, not a fundamental architecture problem.

A note on Intel AutoRound INT4 quality

Let’s be honest — Intel/Qwen3.5-122B-A10B-int4-AutoRound is not perfect. It was quantized with default AutoRound parameters (iters=200, nsamples=128, seqlen=2048) which is… conservative, to put it politely. The model works, it’s the best publicly available INT4 option for this architecture, and we should be grateful it exists. But if someone with serious compute were to re-quantize with nsamples=256 and more calibration iterations, the quality improvement would be significant — lower perplexity, better coherence, fewer quantization artifacts. The speed would stay the same, but the answers would get noticeably better. Hint hint, Intel.

What didn’t help (so you don’t waste your time)

FP8 KV cache: +0.2 tok/s (noise)
NVFP4 (RedHatAI): 16.6 tok/s — slower than INT4 because FP4 CUTLASS kernels don’t work on SM121 yet
Triton native SM121 kernels replacing Marlin: 0% difference — it’s all memory-bandwidth bound
vLLM PR cherry-picks (#38990, #37700): 0% on v0.19.1
Rewriting Marlin for SM121: pointless — SM121 uses the same mma.sync as SM80, no new tensor core instructions

That last one was a painful lesson. SM121 is Blackwell in name but Ampere in ISA (for tensor cores, at least). The 3.65x speedups people report are on datacenter Blackwell (SM100/SM103) with native FP4 CUTLASS. Not us.

38.4 tok/s is likely the memory bandwidth ceiling for this model on a single Spark. We proved it by swapping kernel implementations (Marlin PTX vs Triton native) with zero difference — the GPU is just waiting for LPDDR5x at 273 GB/s. One petaflop of compute, patiently twiddling its thumbs while memory delivers data through a garden hose. The most expensive paperweight-that-could-be-faster-if-only-it-had-faster-RAM in my office.

Benchmark details (Run 2, warm cache)

Test	Baseline	Hybrid	Hybrid+MTP
Q&A (256 tok)	28.3	30.8	37.8
Code (512 tok)	28.3	30.8	39.1
JSON (1024 tok)	28.4	30.9	39.0
Math (64 tok)	27.3	29.7	36.3
Long Code (2048 tok)	28.3	31.0	39.9

All patches, Dockerfile, benchmark script, and a step-by-step guide are here:

GitHub - albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4: Qwen3.5-122B-A10B on DGX Spark: 28.3 → 51 tok/s (+80%) · GitHub

Would love to hear if anyone has found other approaches or managed to go higher. Speculative decoding with more tokens (MTP-2, MTP-3) could theoretically push further, but Qwen3.5 only ships with 1 MTP layer.

This is my first post here, but I’ve been reading this forum religiously for months. Huge thanks to everyone who shares their findings — the hybrid quant pioneers, the NVFP4 explorers, the llama.cpp benchmarkers, and everyone debugging SM121 quirks in the trenches. You’ve all saved me countless hours. Figured it was time to give something back.

norman.2 · April 5, 2026, 10:00am

As the patches are python only, we can probably integrate this as mod @eugr ? :)

Nice thank you for the effort! Will give this a try, as this model is usually my daily driver.

norman.2 · April 5, 2026, 11:00am

Did you come across errors like this?

WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_a.weight
WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_b.weight
WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_qkv.weight
WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_qkv.weight_scale_inv
WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_z.weight
WARNING: unexpected unmatched FP8 tensor model.language_model.layers.0.linear_attn.in_proj_z.weight_scale_inv

WARNING: proceeding despite 408 unexpected unmatched FP8 tensors because --force was provided

Brush · April 5, 2026, 11:00am

Thanks for sharing. I’ll give it a try … and feedback

Working mainly with the 35B version, but would love to switch to 122B, if only the speed would increase. Your solution looks promising.

AoE · April 5, 2026, 11:01am

If I understand your investigation correctly, to be able to benefit from MTP, we only have to build a new model version with:

python patches/02-mtp-speculative/add-mtp-weights.py \
    --source "$INTEL_DIR" \
    --target ~/models/qwen35-122b-hybrid-int4fp8

and all those you simply added the line to the launch script before got the placebo effect?

Albond · April 5, 2026, 11:13am

Yes, this is expected. The 408 unmatched FP8 tensors are mostly linear_attn projections (DeltaNet layers — 36 out of 48 layers in Qwen3.5) plus some attention norms and gates. They exist in the Qwen FP8 checkpoint but don’t have matching counterparts in the Intel AutoRound INT4 checkpoint because the naming conventions differ.
The script only replaces shared_expert dense layers (144 tensors) with FP8 — everything else stays in its original format (BF16/INT4). The --force flag is the right call here.
I verified output quality after building the checkpoint — math, code, Bayesian reasoning, language — no degradation compared to the pure INT4 baseline. The 408 skipped tensors are not a problem.
PS: Trust me, I’ve seen worse — when I tried to fix Gemma 4’s heterogeneous head_dim (256/512) to make FlashAttention work, I got garbage output and wasted time.

Albond · April 5, 2026, 11:14am

yes, MTP is the bigger win, and it works independently of the hybrid patch.
I haven’t actually benchmarked MTP on the plain Intel AutoRound INT4 checkpoint without the hybrid patch.
What I did test end-to-end:

Baseline INT4 + FlashInfer: 28.3 tok/s (verified)
Hybrid INT4+FP8 + FlashInfer: 30.8 tok/s (verified)
Hybrid INT4+FP8 + FlashInfer + MTP-1: 38.4 tok/s (verified)
If you want to skip the hybrid step and just try MTP on vanilla Intel AutoRound, it should work — the MTP weights are architecture-level, not quantization-dependent. But I can’t guarantee the exact number until someone tests it.

Albond · April 5, 2026, 11:28am

The whole point of this optimization work was to get Claude-level intelligence without Claude-level costs. Qwen3.5-122B scores 42 on the Artificial Analysis Intelligence Index — one point below Claude 4.5 Sonnet (43), and beats it on IFBench (76% vs 57%) and Humanity’s Last Exam (23% vs 17%).
That’s why every tok/s matters here — I’m not optimizing a benchmark toy, I’m trying to preserve that intelligence while making it actually usable for daily work. 38.4 tok/s of near-Claude reasoning, fully local, no API bills. The Spark paid for itself in about two months of not paying for cloud API tokens.
I also have Gemma 4 31B-IT at ~10 tok/s — same quality scores but 3.8x slower because 31B dense active params vs 10B MoE. On LPDDR5x, MoE architecture is the only way to run 100B+ class models at interactive speeds. If I could 3D-print a TPU at home, maybe Gemma 4 would win. But I can’t, so here we are.

norman.2 · April 5, 2026, 11:49am

Hybrid INT4+FP8: detected 144 FP8 dense layers (block_size=[128, 128])

Sounds good :D currently evaluating, as I am running this on the community docker, which is only vLLM 0.18.1rc1.dev41. But seems to work, thank you!

── Run 1/2 ──────────────────────────────────────
[Q&A] 256 tokens in 6.62s = 38.6 tok/s (prompt: 23)

[Code] 512 tokens in 12.77s = 40.0 tok/s (prompt: 30)
[JSON] 1024 tokens in 25.86s = 39.5 tok/s (prompt: 48)
[Math] 64 tokens in 1.74s = 36.7 tok/s (prompt: 29)

Llama-benchy doesnt see any increase in performance though? Might be related to how it is testing?

Ignore the model name :)

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)

qwen-coder-next pp512 @ d2048 2089.65 ± 5.17 1227.46 ± 3.03 1225.57 ± 3.03 1227.54 ± 3.03

qwen-coder-next tg32 @ d2048 21.05 ± 0.56 22.33 ± 0.55

qwen-coder-next pp512 @ d2048 2080.09 ± 9.13 1233.11 ± 5.40 1231.22 ± 5.40 1233.18 ± 5.40

qwen-coder-next tg128 @ d2048 20.57 ± 0.06 22.00 ± 0.00

qwen-coder-next pp2048 @ d2048 2286.21 ± 1.46 1793.94 ± 1.15 1792.05 ± 1.15 1794.00 ± 1.14

qwen-coder-next tg32 @ d2048 20.46 ± 0.01 21.74 ± 0.01

qwen-coder-next pp2048 @ d2048 2219.17 ± 64.54 1849.64 ± 53.74 1847.75 ± 53.74 1849.72 ± 53.74

qwen-coder-next tg128 @ d2048 20.43 ± 0.07 21.00 ± 0.00

qwen-coder-next pp8192 @ d2048 2283.25 ± 1.21 4487.17 ± 2.38 4485.28 ± 2.38 4487.21 ± 2.39

qwen-coder-next tg32 @ d2048 20.80 ± 0.57 22.10 ± 0.61

qwen-coder-next pp8192 @ d2048 2279.22 ± 1.96 4495.09 ± 3.86 4493.20 ± 3.86 4495.13 ± 3.86

qwen-coder-next tg128 @ d2048 20.13 ± 0.06 21.00 ± 0.00

qwen-coder-next pp512 @ d12000 2209.33 ± 2.70 5665.60 ± 6.92 5663.71 ± 6.92 5665.64 ± 6.92

qwen-coder-next tg32 @ d12000 20.59 ± 0.48 21.88 ± 0.51

qwen-coder-next pp512 @ d12000 2207.36 ± 2.13 5670.66 ± 5.47 5668.77 ± 5.47 5670.70 ± 5.47

qwen-coder-next tg128 @ d12000 19.86 ± 0.09 21.00 ± 0.00

qwen-coder-next pp2048 @ d12000 2210.20 ± 1.59 6358.32 ± 4.56 6356.43 ± 4.56 6358.36 ± 4.57

qwen-coder-next tg32 @ d12000 20.10 ± 0.04 21.36 ± 0.05

qwen-coder-next pp2048 @ d12000 2200.47 ± 10.52 6386.37 ± 30.76 6384.48 ± 30.76 6386.44 ± 30.75

qwen-coder-next tg128 @ d12000 19.97 ± 0.07 21.00 ± 0.00

qwen-coder-next pp8192 @ d12000 2153.08 ± 1.11 9380.57 ± 4.86 9378.68 ± 4.86 9380.63 ± 4.84

qwen-coder-next tg32 @ d12000 20.95 ± 0.22 22.26 ± 0.23

qwen-coder-next pp8192 @ d12000 2153.40 ± 0.88 9379.14 ± 3.84 9377.25 ± 3.84 9379.24 ± 3.84

qwen-coder-next tg128 @ d12000 19.39 ± 0.02 20.00 ± 0.00

AoE · April 5, 2026, 12:08pm

The script doesn’t seem to be doing anything for me:

Found 785 MTP tensors in source index
Added 0 MTP tensor mappings to index
Total tensors: 112901
Done. MTP speculative decoding is now available.

Albond · April 5, 2026, 12:09pm

About the llama-benchy numbers — the difference is real, just measured differently.

Think of it this way: without MTP, the model does 1 decode step = 1 token. With MTP, the model does 1 decode step but produces ~2 tokens (1 regular + 1 speculative, 95% accepted).

llama-benchy measures decode steps per second — how fast the model runs forward passes. That’s ~20 steps/sec, and each step is actually a tiny bit slower now because of the MTP head overhead. So llama-benchy sees no improvement or even a slight slowdown.

bench_qwen35.sh and real chat measure what you actually get — tokens out divided by wall-clock time. 20 steps/sec × ~1.95 accepted tokens per step = ~39 tok/s. That’s the number you feel when using the model.

Both are correct:

~20 tok/s = how fast the engine runs (decode steps)
~38-40 tok/s = how fast you get your answer (effective throughput)

I see the same thing in my daily use — same prompt that used to take 26 seconds now finishes in 17.

Albond · April 5, 2026, 12:12pm

That’s actually fine — the original Intel AutoRound checkpoint already has MTP tensor mappings in its index. The script found them all present, so nothing to add.

Just make sure the actual weights file exists in your checkpoint directory:

ls -lh /path/to/your/checkpoint/model_extra_tensors.safetensors

If it’s there (~X GB, maybe about 4Gb), you’re all set. Add --speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:1}’ to your launch command and enjoy ~30+ tok/s.

The hybrid step (my patches) is a separate optimization on top — MTP works independently.

AoE · April 5, 2026, 12:16pm

Why run the script if the index already points to the weights?
I’m asking because you wrote:

The MTP part was a surprise. Intel AutoRound actually includes the MTP head weights (model_extra_tensors.safetensors, 4.8 GB) but doesn’t reference them in the model index

So I was expecting the index to be updated.

Albond · April 5, 2026, 12:43pm

Hmm, good catch — need to correct that. The original Intel AutoRound checkpoint does have MTP in the index. The issue only shows up with the hybrid checkpoint from build-hybrid-checkpoint.py which doesn’t carry over MTP mappings. That’s what the script fixes.
So if you’re on vanilla Intel AutoRound — just add --speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:1}’ and you’re done.

Tested it just now:

── Run 2/2 (warm cache) ──────────────────────
[Q&A] 256 tokens in 7.08s = 36.1 tok/s

[Code] 502 tokens in 13.41s = 37.4 tok/s
[JSON] 1024 tokens in 28.05s = 36.5 tok/s
[Math] 64 tokens in 1.86s = 34.4 tok/s
[LongCode] 2048 tokens in 54.01s = 37.9 tok/s

Baseline (INT4 + FlashInfer): 28.3 tok/s
INT4 + MTP only: 36.5 tok/s (+29%)
Hybrid + MTP: 38.4 tok/s (+36%)

MTP alone is the biggest win. Hybrid adds ~2 tok/s on top.

AoE · April 5, 2026, 12:48pm

Thanks for checking :)
MTP doesn’t work with the Pytorch backend, so I was hoping that it was because there was something wrong with the model, but now I’m going to have to try with Ray to see if I see the same kind of gains.

Albond · April 5, 2026, 1:02pm

What error do you get with MTP? And which vLLM version / attention backend are you on? On 0.19 with FlashInfer it works out of the box, but there were several MTP-related bugs in earlier versions for Qwen3.5 (#36843, #36917).

AoE · April 5, 2026, 1:54pm

This is only when using a cluster with eugr’s docker. The connection to the 2nd node never happens. The script gives up after 10 minutes.
No problem with Ray using the unmodified model, and up to 56t/s in agt from the logs when using OWUI, with 85%-100% acceptance rate.

AoE · April 5, 2026, 2:01pm

The latest is v0.19 btw

norman.2 · April 5, 2026, 2:19pm

The cached TF5 version is 0.18 and the main build is currently broken according to eugr. So I did not dare to go to 0.19 :D

AoE · April 5, 2026, 2:26pm

Maybe you caught it at a bad time? This is what I’m running right now:

[utils.py:299]        █     █     █▄   ▄█
[utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev15+g50cd5674b.d20260403
[utils.py:299]   █▄█▀ █     █     █     █  model   Intel/Qwen3.5-122B-A10B-int4-AutoRound
[utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀

Topic		Replies	Views
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	17292	March 24, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	309	28985	June 22, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	11878	April 9, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	6285	March 16, 2026
Qwen3.5-35B-A3B optimizations on single Spark DGX Spark / GB10 Projects	48	3514	May 22, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	64	20883	July 6, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	826	March 20, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	236	9699	June 6, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	62	6426	June 14, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	10574	March 24, 2026

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen-coder-next	pp512 @ d2048	2089.65 ± 5.17		1227.46 ± 3.03	1225.57 ± 3.03	1227.54 ± 3.03
qwen-coder-next	tg32 @ d2048	21.05 ± 0.56	22.33 ± 0.55
qwen-coder-next	pp512 @ d2048	2080.09 ± 9.13		1233.11 ± 5.40	1231.22 ± 5.40	1233.18 ± 5.40
qwen-coder-next	tg128 @ d2048	20.57 ± 0.06	22.00 ± 0.00
qwen-coder-next	pp2048 @ d2048	2286.21 ± 1.46		1793.94 ± 1.15	1792.05 ± 1.15	1794.00 ± 1.14
qwen-coder-next	tg32 @ d2048	20.46 ± 0.01	21.74 ± 0.01
qwen-coder-next	pp2048 @ d2048	2219.17 ± 64.54		1849.64 ± 53.74	1847.75 ± 53.74	1849.72 ± 53.74
qwen-coder-next	tg128 @ d2048	20.43 ± 0.07	21.00 ± 0.00
qwen-coder-next	pp8192 @ d2048	2283.25 ± 1.21		4487.17 ± 2.38	4485.28 ± 2.38	4487.21 ± 2.39
qwen-coder-next	tg32 @ d2048	20.80 ± 0.57	22.10 ± 0.61
qwen-coder-next	pp8192 @ d2048	2279.22 ± 1.96		4495.09 ± 3.86	4493.20 ± 3.86	4495.13 ± 3.86
qwen-coder-next	tg128 @ d2048	20.13 ± 0.06	21.00 ± 0.00
qwen-coder-next	pp512 @ d12000	2209.33 ± 2.70		5665.60 ± 6.92	5663.71 ± 6.92	5665.64 ± 6.92
qwen-coder-next	tg32 @ d12000	20.59 ± 0.48	21.88 ± 0.51
qwen-coder-next	pp512 @ d12000	2207.36 ± 2.13		5670.66 ± 5.47	5668.77 ± 5.47	5670.70 ± 5.47
qwen-coder-next	tg128 @ d12000	19.86 ± 0.09	21.00 ± 0.00
qwen-coder-next	pp2048 @ d12000	2210.20 ± 1.59		6358.32 ± 4.56	6356.43 ± 4.56	6358.36 ± 4.57
qwen-coder-next	tg32 @ d12000	20.10 ± 0.04	21.36 ± 0.05
qwen-coder-next	pp2048 @ d12000	2200.47 ± 10.52		6386.37 ± 30.76	6384.48 ± 30.76	6386.44 ± 30.75
qwen-coder-next	tg128 @ d12000	19.97 ± 0.07	21.00 ± 0.00
qwen-coder-next	pp8192 @ d12000	2153.08 ± 1.11		9380.57 ± 4.86	9378.68 ± 4.86	9380.63 ± 4.84
qwen-coder-next	tg32 @ d12000	20.95 ± 0.22	22.26 ± 0.23
qwen-coder-next	pp8192 @ d12000	2153.40 ± 0.88		9379.14 ± 3.84	9377.25 ± 3.84	9379.24 ± 3.84
qwen-coder-next	tg128 @ d12000	19.39 ± 0.02	20.00 ± 0.00

Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

What actually helped

A note on Intel AutoRound INT4 quality

What didn’t help (so you don’t waste your time)

Benchmark details (Run 2, warm cache)

Related topics