50%+ Improvement on spark?!

sesmanovic · March 16, 2026, 4:49pm

Hi, I’m currently trying to fix a few issues so I can run the benchmark properly. At the moment, it crashes when MTP > 2. But yes, of course, I can push my work to your repo. I just want to make sure I’m doing it the right way. The goal is to contribute usefully, not to push something pointless.

DannyTup · March 16, 2026, 7:37pm

FWIW in case others didn’t see it, the original reddit post was edited with this:

EDIT: BASICALLY IGNORE THESE RESULTS OF below, because I coudn’t reproduce them with respect to speed, while controlling vor variables of thinking enabled and MTP. While controlling for them i saw maybe a 2.5 to 6 percent increase, which is probably within MOE. My apologies on this one folks. Im sorry.

An increase is still an increase, but if I’m understanding the edit properly, it’s not the jump that was originally described.

sesmanovic · March 16, 2026, 8:59pm

After the feedback here, I reran the tests with llama-benchy instead of relying on my initial manual measurements and vLLM throughput logs.

My original post reported 19.6 tok/s with MTP=3, but after digging deeper, that number was misleading for real user-visible throughput. The vLLM logs were clearly not the right metric to use here for speculative decoding on this setup.

Using llama-benchy, the best stable result I can get on a single DGX Spark with Qwen3.5-122B-A10B-NVFP4 is actually with MTP disabled:

llama-benchy results

MTP=0

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
sehyo/Qwen3.5-122B-A10B-NVFP4	pp128	264.74 ± 184.85		11084.43 ± 15212.30	11083.72 ± 15212.30	11084.50 ± 15212.31
sehyo/Qwen3.5-122B-A10B-NVFP4	tg256	14.31 ± 0.07	15.00 ± 0.00

MTP=1

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
sehyo/Qwen3.5-122B-A10B-NVFP4	pp128	245.77 ± 122.02		847.31 ± 647.26	846.20 ± 647.26	847.37 ± 647.27
sehyo/Qwen3.5-122B-A10B-NVFP4	tg256	11.86 ± 0.10	13.00 ± 0.00

I also tested MTP=2, and llama-benchy crashes on the third pass on my setup, so that configuration is not stable here.

So the corrected conclusion is:

MTP=0 is currently the best stable configuration on my DGX Spark
MTP=1 is slower than no MTP
MTP=2 is unstable under llama-benchy
the previous ~19.6 tok/s number from my original post should not be treated as real end-user throughput

In short, the best stable throughput I can currently reproduce is 14.31 tok/s with MTP=0.

Thanks to the people here who pushed me to validate this with llama-benchy instead of relying on the raw vLLM logs.

Dickson · March 16, 2026, 9:27pm

Good thing he corrected himself. Is it just me or is the AI psychosis fueled “break through” posts are becoming more frequent?

twaggs88 · March 18, 2026, 2:17am

I got a decent boost with this PR + CUTLASS 4.4.2.

raphael.amorim · March 24, 2026, 5:11am

It seems we’re still far from AGI. Hold on to your brains fellas, we’re going to need all of them for quite some time still.

Topic		Replies	Views
Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) DGX Spark / GB10 Projects	4	838	April 16, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	431	21526	June 18, 2026
I am EXTREMely disappointed with the current state of DGX Spark DGX Spark / GB10	90	16909	June 17, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	6027	March 16, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	263	June 19, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8751	March 14, 2026
DGX Spark: 13 → 49 tok/s with Qwen3.5-35B — Native SM121 Kernel Build Guide DGX Spark / GB10 Projects cuda , cusparse	13	1357	April 1, 2026
DGX Spark performance DGX Spark / GB10	49	6044	February 13, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	143	7562	February 24, 2026
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	252	16539	June 22, 2026

50%+ Improvement on spark?!

llama-benchy results

MTP=0

MTP=1

Related topics