Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

About the llama-benchy numbers — the difference is real, just measured differently.

Think of it this way: without MTP, the model does 1 decode step = 1 token. With MTP, the model does 1 decode step but produces ~2 tokens (1 regular + 1 speculative, 95% accepted).

llama-benchy measures decode steps per second — how fast the model runs forward passes. That’s ~20 steps/sec, and each step is actually a tiny bit slower now because of the MTP head overhead. So llama-benchy sees no improvement or even a slight slowdown.

bench_qwen35.sh and real chat measure what you actually get — tokens out divided by wall-clock time. 20 steps/sec × ~1.95 accepted tokens per step = ~39 tok/s. That’s the number you feel when using the model.

Both are correct:

  • ~20 tok/s = how fast the engine runs (decode steps)
  • ~38-40 tok/s = how fast you get your answer (effective throughput)

I see the same thing in my daily use — same prompt that used to take 26 seconds now finishes in 17.