Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

fususu · April 12, 2026, 5:02pm

Spot on, @stefan132. The quality/perplexity hit on INT4 is a valid concern, especially for complex reasoning or coding where every bit of precision counts.

I’m actually finalizing an update to my vllm.sh manager script right now to include a ‘Native FP8 + MTP’ option as a first-class citizen. It bypasses the INT4 merging entirely and patches the MTP weights directly onto the official Qwen FP8 repo.

Here’s a sneak peek of the new interactive menu I’m testing on my GX10. Option 3 is exactly what you’re looking for:

Bash

=== vLLM Manager for ASUS GX10 ===
  1. First-time setup (clone repo + build Docker + download model)
  2. Select model and start server
  3. Stop server
  4. View logs
  5. Run benchmark
  6. Rebuild Docker image (--no-cache)

Select (1-6): 1

=== Select model to install ===
  1. Qwen3.5-122B-A10B Hybrid  (~51 tok/s)  ✓ already installed
  2. Qwen3.5-35B-A3B Hybrid    (~112 tok/s | best speed, INT4+FP8 merged)
  3. Qwen3.5-35B-A3B FP8+MTP   (better quality, no INT4, ~35GB download)
  4. Custom model              (enter INT4 AutoRound + FP8 repo)

Select (1-5): 3

═══ Install FP8 native + MTP: Qwen3.5-35B-A3B ═══
[✓] FP8 source : Qwen/Qwen3.5-35B-A3B-FP8
[✓] Downloading FP8 model to local dir...

I’ll have this version pushed to the repo by the time @joshua.dale.warner starts the new thread. This way, everyone can choose their own ‘reality’—either max throughput (Hybrid) or max quality (Native FP8) with a healthy MTP boost.

Topic		Replies	Views
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	15521	March 24, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	228	17658	May 8, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5254	March 16, 2026
Qwen3.5-35B-A3B optimizations on single Spark DGX Spark / GB10 Projects	46	2216	May 4, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9786	April 9, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	711	March 20, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	229	7309	April 20, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	9169	March 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	5379	May 4, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	18	1741	April 16, 2026

Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Related topics