Hi, I’m currently trying to fix a few issues so I can run the benchmark properly. At the moment, it crashes when MTP > 2. But yes, of course, I can push my work to your repo. I just want to make sure I’m doing it the right way. The goal is to contribute usefully, not to push something pointless.
FWIW in case others didn’t see it, the original reddit post was edited with this:
EDIT: BASICALLY IGNORE THESE RESULTS OF below, because I coudn’t reproduce them with respect to speed, while controlling vor variables of thinking enabled and MTP. While controlling for them i saw maybe a 2.5 to 6 percent increase, which is probably within MOE. My apologies on this one folks. Im sorry.
An increase is still an increase, but if I’m understanding the edit properly, it’s not the jump that was originally described.
After the feedback here, I reran the tests with llama-benchy instead of relying on my initial manual measurements and vLLM throughput logs.
My original post reported 19.6 tok/s with MTP=3, but after digging deeper, that number was misleading for real user-visible throughput. The vLLM logs were clearly not the right metric to use here for speculative decoding on this setup.
Using llama-benchy, the best stable result I can get on a single DGX Spark with Qwen3.5-122B-A10B-NVFP4 is actually with MTP disabled:
llama-benchy results
MTP=0
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| sehyo/Qwen3.5-122B-A10B-NVFP4 | pp128 | 264.74 ± 184.85 | 11084.43 ± 15212.30 | 11083.72 ± 15212.30 | 11084.50 ± 15212.31 | |
| sehyo/Qwen3.5-122B-A10B-NVFP4 | tg256 | 14.31 ± 0.07 | 15.00 ± 0.00 |
MTP=1
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| sehyo/Qwen3.5-122B-A10B-NVFP4 | pp128 | 245.77 ± 122.02 | 847.31 ± 647.26 | 846.20 ± 647.26 | 847.37 ± 647.27 | |
| sehyo/Qwen3.5-122B-A10B-NVFP4 | tg256 | 11.86 ± 0.10 | 13.00 ± 0.00 |
I also tested MTP=2, and llama-benchy crashes on the third pass on my setup, so that configuration is not stable here.
So the corrected conclusion is:
MTP=0is currently the best stable configuration on my DGX SparkMTP=1is slower than no MTPMTP=2is unstable underllama-benchy- the previous
~19.6 tok/snumber from my original post should not be treated as real end-user throughput
In short, the best stable throughput I can currently reproduce is 14.31 tok/s with MTP=0.
Thanks to the people here who pushed me to validate this with llama-benchy instead of relying on the raw vLLM logs.
Good thing he corrected himself. Is it just me or is the AI psychosis fueled “break through” posts are becoming more frequent?
I got a decent boost with this PR + CUTLASS 4.4.2.
It seems we’re still far from AGI. Hold on to your brains fellas, we’re going to need all of them for quite some time still.