Spot on, @stefan132. The quality/perplexity hit on INT4 is a valid concern, especially for complex reasoning or coding where every bit of precision counts.
I’m actually finalizing an update to my vllm.sh manager script right now to include a ‘Native FP8 + MTP’ option as a first-class citizen. It bypasses the INT4 merging entirely and patches the MTP weights directly onto the official Qwen FP8 repo.
Here’s a sneak peek of the new interactive menu I’m testing on my GX10. Option 3 is exactly what you’re looking for:
Bash
=== vLLM Manager for ASUS GX10 ===
1. First-time setup (clone repo + build Docker + download model)
2. Select model and start server
3. Stop server
4. View logs
5. Run benchmark
6. Rebuild Docker image (--no-cache)
Select (1-6): 1
=== Select model to install ===
1. Qwen3.5-122B-A10B Hybrid (~51 tok/s) ✓ already installed
2. Qwen3.5-35B-A3B Hybrid (~112 tok/s | best speed, INT4+FP8 merged)
3. Qwen3.5-35B-A3B FP8+MTP (better quality, no INT4, ~35GB download)
4. Custom model (enter INT4 AutoRound + FP8 repo)
Select (1-5): 3
═══ Install FP8 native + MTP: Qwen3.5-35B-A3B ═══
[✓] FP8 source : Qwen/Qwen3.5-35B-A3B-FP8
[✓] Downloading FP8 model to local dir...
I’ll have this version pushed to the repo by the time @joshua.dale.warner starts the new thread. This way, everyone can choose their own ‘reality’—either max throughput (Hybrid) or max quality (Native FP8) with a healthy MTP boost.