I feel honored and ashamed at the same time :) I vibe-coded it to monitor my throughput and % to evaluate what’s working best for me. I 100% can make it available on Github. Give me a few mins and I’ll do it :)
BTW, Thank you!! A couple of your posts helped me for the tool-calling fix on this 122B model :) Sadly I couldn’t find the perfect one that works 100% for both Claude Code and OpenCode, but I’m 90% there with your modifications and suggestions!
3 Likes
Thanks for sharing!
A note for @azampatti re: “OWUI takes a life to respond”
on my config, OWUI is perfectly usable. 55 tok/s sustained through two parallel chats, even with thinking enabled by default. The `fp8 KV` + `util 0.90` combo may be what removes the bottleneck. Also disabling OWUI’s background Title/Tags/Follow-up generation (settings → interface) cuts 3x hidden LLM calls per message.
Happy ASUS Ascent GX10 owner here (2 weeks in). :)
Sharing my setup based on
@Albond’s v2 pipeline - in case useful for others.
Image: custom build of vLLM 0.19.1.dev0+g2a69949bd (built Apr 16 from
rmstxrx/vllm-hybrid-quant base, 18.3 GB)
Launch:
docker run -d --name vllm-qwen35
–gpus all --net=host --ipc=host --privileged
-v /home/gx10/models:/models
vllm-qwen35-v2
serve /models/qwen35-122b-hybrid-int4fp8
–served-model-name qwen --port 8000 --host 0.0.0.0
–max-model-len 262144
–gpu-memory-utilization 0.90
–load-format fastsafetensors
–attention-backend FLASHINFER
–reasoning-parser qwen3
–enable-auto-tool-choice
–tool-call-parser qwen3_xml
–enable-prefix-caching
–enable-chunked-prefill
–max-num-batched-tokens 32768
–kv-cache-dtype fp8
–generation-config vllm
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:2}’
Key difference from others here: --kv-cache-dtype fp8 + utilization 0.90
→ KV cache 620,928 tokens, 7.84x concurrency at full 262K context. Tested on
550+ requests, no observable quality loss.
Thermal note - important on ASUS Ascent: I had to cap GPU clock with
sudo nvidia-smi -lgc 0,2200 because under sustained load CPU zones were
crossing 92°C (SoC shares thermal budget between GPU and CPU on GB10).
After the cap: max 82°C under load, no throttling, minimal throughput
impact (inference is memory-bandwidth bound, not clock-bound).
Results:
| Scenario |
Throughput |
MTP acceptance |
| Single chat |
45-62 tok/s |
86-100% |
| 2 parallel in OWUI (thinking on) |
49-55 tok/s sustained |
68-75% |
| 2 parallel in OWUI (thinking off) |
34-41 tok/s |
53-62% |
| Structured JSON (vault mining, 6 parallel) |
51 tok/s combined |
89-92% |
| Prefill peak on long context |
12,633 tok/s |
-– |
Model load time 62s (fastsafetensors). Uptime 6+ days, 550+ successful requests,
zero errors.
Thanks @Albond for the v2 pipeline - built everything on top of it.