Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Albond · April 5, 2026, 2:35pm

56 tok/s on dual Spark … tempting, but that requires a second Spark! Though for me personally, buying a second Spark just for ~70% scaling is hard to justify — especially in my country where power isn’t always guaranteed and I’m running off a power station from time to time :) Would need 4x Spark to unlock 250B+ models, and that’s a different budget entirely and not enough electricity right now. So I’ll keep squeezing every last drop out of a single Spark — maybe there’s still something left to find. Would love to hear if the community has any new ideas worth trying.

Albond · April 5, 2026, 4:01pm

Has anyone tested EAGLE-3 instead of MTP-1 (arXiv:2503.01840)? Are there any quality issues with it?

norman.2 · April 5, 2026, 4:24pm

@flash3 did some work with EAGLE vllm-marlin-sm12x/RESULTS.md at main · flash7777/vllm-marlin-sm12x · GitHub

dkopko · April 5, 2026, 5:19pm

@Albond Thank you for this thread. I noticed something that may be interesting to you:

–speculative-config '{“method”:“qwen3_next_mtp”,“num_speculative_tokens”:2}'

This is documented in the README.md within the model contents for Intel/Qwen3.5-35B-A3B-int4-AutoRound, but it is not mentioned in the README.md of the model contents of Intel/Qwen3.5-122B-A10B-int4-AutoRound . Have you tried this to any success?

Albond · April 5, 2026, 6:03pm

Thank you for this param. Up to +20% free! I will update project and main post soon with some new updates and include it to scripts.
But want to check the best: EAGLE3 vs MTP-2. I think, EAGLE3 should be better. Let’s see.

flash3 · April 5, 2026, 6:23pm

DGX has more success on lower NST (f.i. NST=1) instead of RTX (NST=3). DGX is memory bound.

joshua.dale.warner · April 5, 2026, 6:39pm

Nice initiative, I really like the idea, going to try it out.

FYI for the initial patch call, I had to change the flags to --fp8-repo and --output from what is shown in the main README.

erlendboe · April 5, 2026, 6:58pm

Here are my results (Single DGX Spark, non-headless), using the the vllm from eugr, and just adding the speculative-config to the command line

./run-recipe.sh qwen3.5-122b-int4-autoround --solo -e HF_TOKEN=$HF_TOKEN – --speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”: 1}’

── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 7.13s = 35.9 tok/s (prompt: 23)

[Code] 502 tokens in 13.41s = 37.4 tok/s (prompt: 30)
[JSON] 1024 tokens in 27.85s = 36.7 tok/s (prompt: 48)
[Math] 64 tokens in 1.91s = 33.5 tok/s (prompt: 29)
[LongCode] 2048 tokens in 54.14s = 37.8 tok/s (prompt: 37)

./run-recipe.sh qwen3.5-122b-int4-autoround --solo -e HF_TOKEN=$HF_TOKEN – --speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”: 2}’

── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 6.52s = 39.2 tok/s (prompt: 23)

[Code] 512 tokens in 12.80s = 40.0 tok/s (prompt: 30)
[JSON] 1024 tokens in 25.83s = 39.6 tok/s (prompt: 48)
[Math] 64 tokens in 1.72s = 37.2 tok/s (prompt: 29)
[LongCode] 2048 tokens in 49.87s = 41.0 tok/s (prompt: 37)

Albond · April 5, 2026, 7:07pm

Had similar … but stuck with EAGLE3 to compare EAGLE3 and MTP-2. EAGLE3 has trouble with DeltaNet. Right now MTP-2 is the best for performance improvement.

Albond · April 5, 2026, 7:20pm

EAGLE3 is slower than MTP-2 on Qwen3.5 hybrid (DeltaNet) architecture. Will update soon. Thank you for such features!

jwarner · April 5, 2026, 7:43pm

Big picture, if you want higher density I suggest running a Qwen3.5-27B quant with MTP over Gemini. I prefer like the FP8 version with RYS-XL, which has a few duplicated layers with all features presented for higher intelligence at ~31B total parameters. With MTP that model gives 11-12 tokens/s and frankly seems smarter than Gemini 31B.

Sorry for the tangent, back to the 122B hybrid…

Albond · April 5, 2026, 8:13pm

I’ve been seeing consistent issues with compact models, and for me, a LLM below 100b is accompanied by an increase in failure rates. However, I see progress in increase intelligence density also. If I had a choice, I’d run models at 500b.

XQDev · April 5, 2026, 8:18pm

what do you think about Sehyo/Qwen3.5-122B-A10B-NVFP4 model?

it’s also support MTP and it was good quantized.

rk119 · April 5, 2026, 8:24pm

So do we think that MTP would also improve Qwen397B? Has there been any look into that?
Totally, I was able to run Qwen3.5-27B with MTP3 on R9700, but it only helped if I was running it in TP = 2 mode, lifted FP8 quant performance to about 30 tokens per second with. about 70% acceptance rate. But that was done by sort of calculating tokens based on a wall clock, because llama benchy calcs were lower in 12-13 t/s range, as was discussed earlier here. Anecdotally, based on vibes, just interacting with the model in OpenWebUI, the speed did not seem as good as what I was getting with Qwen397 running on dual sparks. Right now I’m running 122B on dual Sparks and I’m getting 45 tokens per second which is pretty feels pretty speedy and prompt caching is good, prompt processing is very excellent it really feels better than whatever Anthropic gives us, it feels almost as good as Codex in the fast mode. When I say it feels good, I’m just talking about the response speed and not about the quality of the response.
So I think my next project will be to see if MTP helps for Qwen397B. If we could bring it into 40 tokens per second range, that would be excellent.
Also I gained about 8% performance by using —no-ray option for 122B so I had to use a Eugr’s recipe instead of Sparkrun.

Albond · April 5, 2026, 8:26pm

In github about NVFP4 Quantization (16.6 tok/s – 42% slower)

RedHatAI/Qwen3.5-122B-A10B-NVFP4 sounds great on paper. In practice: SM121 doesn’t have working FP4 CUTLASS kernels in vLLM yet, so it falls back to Marlin SM80 which handles FP4 poorly. Result: 16.6 tok/s, less than half our baseline. Waiting for vLLM PRs #38957 and #31607.

Albond · April 5, 2026, 8:31pm

I am trying to check all options for Qwen3.5 MoE and MTP right now is the best: lossless and free tokens. But I found some other ways to get more tokens with lossless, as example, INT8 LM Head Triton. Theoretical, we can get about 100 tok/s for Qwen3.5 MoE 122b 10A.

XQDev · April 5, 2026, 8:48pm

could you please give the link or add to README.md instuctuon how to comile vLLM 0.19.x for SM121 please

Albond · April 5, 2026, 9:05pm

You don’t need to compile vLLM from source manually — there’s a community project that handles everything: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub . This builds a Docker image with vLLM + FlashInfer compiled for SM121. It downloads prebuilt wheels from GitHub releases by default, so the build takes ~2-3 minutes (no source compilation needed).
If you want to build from a specific vLLM version:
./build-and-copy.sh --vllm-ref v0.19.1
Or build from source (takes 20-40 min first time):
./build-and-copy.sh --rebuild-vllm --rebuild-flashinfer
Regarding NVFP4 on SM121 — I only tested RedHatAI/Qwen3.5-122B-A10B-NVFP4, which gave 16.6 tok/s (42% slower than INT4 baseline). Haven’t tried Sehyo/Qwen3.5-122B-A10B-NVFP4 specifically, but the bottleneck is the same — SM121 doesn’t have native FP4 CUTLASS kernels in vLLM yet, so any NVFP4 model falls back to Marlin SM80 which handles FP4 poorly. Waiting on vLLM PRs #38957 and #31607. Once those land, NVFP4 might become competitive — but until then, INT4 AutoRound is the way to go on Spark.

rk119 · April 5, 2026, 10:06pm

So I’m testing on dual spark with MTP2 and MTP1. I’m getting regressions.

Config │ t/s │ Acceptance
no MTP │ 44.5 │ N/A
MTP-1 │ ~27 │ 82.8%
MTP-2 │ ~28 │ 68.6%

Claude is saying:
The PIECEWISE cudagraph mode (forced by FlashInfer + spec-decode) could be the culprit.

Albond · April 5, 2026, 10:18pm

As @AoE mentioned, MTP doesn’t work with dual Spark. I have no 2nd Spark to review reason right now.

Topic		Replies	Views
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	14834	March 24, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	88	7372	April 20, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	4807	March 16, 2026
Qwen3.5-35B-A3B optimizations on single Spark DGX Spark / GB10 Projects	39	1203	April 20, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	8819	April 9, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	674	March 20, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	229	6736	April 20, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	8600	March 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4619	April 11, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	18	1144	April 16, 2026

Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Related topics