⚡ Update: v2 (post #71) achieves 51 tok/s. v2.1 (post #104) adds a quick-start script. See those posts for the latest setup. Been chasing every last token/second out of Qwen3.5-122B-A10B on a single DGX Spark for the past few weeks. Not sure if anyone else is still optimizing this model on Spark, b…

Thanks for testing this and wow thats a considerable performance regression. Did not expect that.

I’ve decided to keep my current 0.19.1-stable build for Qwen3.5. I’ll review vLLM 0.20.1 to see what’s worth adopting and release updates if I find useful improvements. Otherwise, there’s no point in degrading something that already works well :)

Based on my observations, version 0.20.1 brought some contradictory changes (which is also confirmed by tool-eval): On one hand, tool calls have been stabilized for stock templates in 3.5 and 3.6 (without MTP). On the other hand, enabling MTP actually makes tool quality worse than it was befor…

@albond What do you think accounts for the change between 0.19.1 and 0.20.1?

Many recent vLLM releases focus on “trend LLMs” and hype features — when something breaks for a new architecture it gets fixed quickly, but nobody runs regression benchmarks for the older production models. I’ve tested 0.19.1 → 0.20.1 on Qwen3.5-122B-A10B (hybrid INT4+FP8 + INT8 LM Head + MTP-2) on …

Yes (look into Atlas to get some idea), but that’s outside the project scope — this thread is about the 122b model on vLLM. I’ve been using 122b to build up my dataset, and a paid model on the side to fix some issues. After a month of use, I now have enough data to fine-tune my own model. Soon I sh…

v2.4 release — probably final 0.19.1-based stack for Qwen3.5-122B-A10B on DGX Spark We shipped two small backports on top of the wonderwork v2 stack. Both default-on as of 2026-05-09 in ./install.sh. Backport 1 — @triton.autotune on the INT8 LM Head GEMV (community tip from this thread). The v2 ke…

Note on the rebuild cost: PR #38325 patches a .cuh baked into the vLLM C extension → full ~30-60 min NVCC rebuild required. Autotune itself is free (1-second Python text-replace). If the marginal +0.76% (~1 tok/s) isn’t worth the rebuild time, run ./install.sh --no-pr38325 — you still get autotune f…

Thanks for your hard work, man. You have been instrumental for a lot of us around here. Just FYI, I’ve been using MTP 3 for the past week or so as I can measure an effective bump in performance with high acceptance rate (monitored by tool-eval-bench --spec-live ) on my real-world usage. Maybe worth…

Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

norman.2 May 9, 2026, 9:35am 396

Will you be looking into https://atlasinference.io/#models ? As its for Spark and RTX specifically and a lot smaller, things like call overhead could be more easily addressed / not become an issue :)

Topic		Replies	Views
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	15618	March 24, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	239	18731	May 11, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5294	March 16, 2026
Qwen3.5-35B-A3B optimizations on single Spark DGX Spark / GB10 Projects	46	2298	May 4, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9889	April 9, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	713	March 20, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	230	7448	May 11, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	9233	March 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	5459	May 4, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	56	5082	April 13, 2026

Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Related topics