About the llama-benchy numbers — the difference is real, just measured differently.
Think of it this way: without MTP, the model does 1 decode step = 1 token. With MTP, the model does 1 decode step but produces ~2 tokens (1 regular + 1 speculative, 95% accepted).
llama-benchy measures decode steps per second — how fast the model runs forward passes. That’s ~20 steps/sec, and each step is actually a tiny bit slower now because of the MTP head overhead. So llama-benchy sees no improvement or even a slight slowdown.
bench_qwen35.sh and real chat measure what you actually get — tokens out divided by wall-clock time. 20 steps/sec × ~1.95 accepted tokens per step = ~39 tok/s. That’s the number you feel when using the model.
Both are correct:
- ~20 tok/s = how fast the engine runs (decode steps)
- ~38-40 tok/s = how fast you get your answer (effective throughput)
I see the same thing in my daily use — same prompt that used to take 26 seconds now finishes in 17.