Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Spot on, @stefan132. The quality/perplexity hit on INT4 is a valid concern, especially for complex reasoning or coding where every bit of precision counts.

I’m actually finalizing an update to my vllm.sh manager script right now to include a ‘Native FP8 + MTP’ option as a first-class citizen. It bypasses the INT4 merging entirely and patches the MTP weights directly onto the official Qwen FP8 repo.

Here’s a sneak peek of the new interactive menu I’m testing on my GX10. Option 3 is exactly what you’re looking for:

Bash

=== vLLM Manager for ASUS GX10 ===
  1. First-time setup (clone repo + build Docker + download model)
  2. Select model and start server
  3. Stop server
  4. View logs
  5. Run benchmark
  6. Rebuild Docker image (--no-cache)

Select (1-6): 1

=== Select model to install ===
  1. Qwen3.5-122B-A10B Hybrid  (~51 tok/s)  ✓ already installed
  2. Qwen3.5-35B-A3B Hybrid    (~112 tok/s | best speed, INT4+FP8 merged)
  3. Qwen3.5-35B-A3B FP8+MTP   (better quality, no INT4, ~35GB download)
  4. Custom model              (enter INT4 AutoRound + FP8 repo)

Select (1-5): 3

═══ Install FP8 native + MTP: Qwen3.5-35B-A3B ═══
[✓] FP8 source : Qwen/Qwen3.5-35B-A3B-FP8
[✓] Downloading FP8 model to local dir...

I’ll have this version pushed to the repo by the time @joshua.dale.warner starts the new thread. This way, everyone can choose their own ‘reality’—either max throughput (Hybrid) or max quality (Native FP8) with a healthy MTP boost.