Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

azampatti · April 19, 2026, 11:30pm

I feel honored and ashamed at the same time :) I vibe-coded it to monitor my throughput and % to evaluate what’s working best for me. I 100% can make it available on Github. Give me a few mins and I’ll do it :)

BTW, Thank you!! A couple of your posts helped me for the tool-calling fix on this 122B model :) Sadly I couldn’t find the perfect one that works 100% for both Claude Code and OpenCode, but I’m 90% there with your modifications and suggestions!

vasenev.ea · April 22, 2026, 5:59pm

Thanks for sharing!

A note for @azampatti re: “OWUI takes a life to respond”
on my config, OWUI is perfectly usable. 55 tok/s sustained through two parallel chats, even with thinking enabled by default. The `fp8 KV` + `util 0.90` combo may be what removes the bottleneck. Also disabling OWUI’s background Title/Tags/Follow-up generation (settings → interface) cuts 3x hidden LLM calls per message.

Happy ASUS Ascent GX10 owner here (2 weeks in). :)

Sharing my setup based on
@Albond’s v2 pipeline - in case useful for others.

Image: custom build of vLLM 0.19.1.dev0+g2a69949bd (built Apr 16 from
rmstxrx/vllm-hybrid-quant base, 18.3 GB)

Launch:

docker run -d --name vllm-qwen35
–gpus all --net=host --ipc=host --privileged
-v /home/gx10/models:/models
vllm-qwen35-v2
serve /models/qwen35-122b-hybrid-int4fp8
–served-model-name qwen --port 8000 --host 0.0.0.0
–max-model-len 262144
–gpu-memory-utilization 0.90
–load-format fastsafetensors
–attention-backend FLASHINFER
–reasoning-parser qwen3
–enable-auto-tool-choice
–tool-call-parser qwen3_xml
–enable-prefix-caching
–enable-chunked-prefill
–max-num-batched-tokens 32768
–kv-cache-dtype fp8
–generation-config vllm
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:2}’

Key difference from others here: --kv-cache-dtype fp8 + utilization 0.90
→ KV cache 620,928 tokens, 7.84x concurrency at full 262K context. Tested on
550+ requests, no observable quality loss.

Thermal note - important on ASUS Ascent: I had to cap GPU clock with
sudo nvidia-smi -lgc 0,2200 because under sustained load CPU zones were
crossing 92°C (SoC shares thermal budget between GPU and CPU on GB10).
After the cap: max 82°C under load, no throttling, minimal throughput
impact (inference is memory-bandwidth bound, not clock-bound).

Results:

Scenario	Throughput	MTP acceptance
Single chat	45-62 tok/s	86-100%
2 parallel in OWUI (thinking on)	49-55 tok/s sustained	68-75%
2 parallel in OWUI (thinking off)	34-41 tok/s	53-62%
Structured JSON (vault mining, 6 parallel)	51 tok/s combined	89-92%
Prefill peak on long context	12,633 tok/s	-–

Model load time 62s (fastsafetensors). Uptime 6+ days, 550+ successful requests,
zero errors.

Thanks @Albond for the v2 pipeline - built everything on top of it.

Topic		Replies	Views
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	14930	March 24, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	130	9681	April 22, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	4872	March 16, 2026
Qwen3.5-35B-A3B optimizations on single Spark DGX Spark / GB10 Projects	39	1319	April 20, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9020	April 9, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	680	March 20, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	229	6866	April 20, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	8689	March 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4691	April 11, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	18	1214	April 16, 2026

Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Related topics