Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D

They are more tuned for the model with TP2, run on the cluster to optimize the distribution of experts. The script comes within the official Nvidia VLLM images, sparkrun tune | sparkrun @sparkrun already includes it, although I created my own script. swap-laboratories/moe-configs at main · vedcsolution/swap-laboratories · GitHub , The repo is a bit neglected, but it has the moe files. Currently, if I launch from Sparkrun, it crashes with too much context; it only works well from ./launchscript. @eugr, I don’t know if it’s Connext7.

Thanks! I know something is wrong with my setup question is what exactly. Based on my recollection from past events it all started after a series of firmware updates right about the time when Nvidia rolled out kernel v6.17. And I don’t believe its a problem just with my 2 spark cluster, but this slowdown happens even on a single Spark :(

1 Like

Out of curiosity, are your “tuning” the model in the opencode config(s) or just using the defaults?

sparkrun tuning is pretty volatile/experimental, but I’m hoping to get that in a good place soon. Can you make github issue or DM me problems w/ launching from sparkrun? Trying to tackle all that stuff so that it’s a reliable solution.

1 Like

I’ll test with my builds, thanks!

1 Like

spec decoding is often tuned to certain workloads and smaller multi-turn conversations. I have a separate branch of llama-benchy that tries to emulate multi-turn conversation, but so far there was no significant difference.

As for different clients, the MTP generation is very uneven, I feel like it averages out to what llama-benchy measures in the end. I’ll come back to it once I go through my current backlog, as I really want to have a good MTP support in benchmarking.

1 Like

Do you mind posting your llama-benchy launch params too?

uv run llama-benchy \
--base-url http://localhost:8000/v1 \
--model Qwen/Qwen3.5-122B-A10B-FP8 \
--served-model-name Qwen3.5-122B-A10B-FP8 \
--pp 512 2048 8192 \
--tg 32 128 \
--depth 0 \
--runs 3 \
--latency-mode generation \
--concurrency 1 \
--save-result "results.md"
1 Like

Awesome, I’d love to see MTP support.

I have been trying to accelerate inference with speculative decoding for a few hours now but no luck. I have tried various launch parameter combinations, rebuilt the vllm-node image, tried applying MTP to different Qwen3.5 model sizes, and running the models with and without tensor parallel across one and two nodes.

So far, speculative decoding consistently reduces the token generation rate by ~50%. I have confirmed that this deceleration is not a measurement uncertainty in llama-benchy (by collecting time stamps and counting generated tokens using a man-in-the-middle proxy server) - the slowdown is measurable.

My English is bad, I use a translator. Sparkrun crashes in high contexts; the head node shuts down. The only place I’ve managed to get it stable is on spark-vllm-docker. It must be something on the network. I haven’t tried using tuned in Sparkrun; I see you have it in experimental. Well, if it could be shared on Spark Arena, we’d use other people’s Moe.

Recompiled, but not seeing any improvements. I am quite convinced most of my problems are ConnextX driven, but what I do not get is why am I the only one with such a bad llama-benchy results.

I guess I will need to wait for Nvidia’s next big update and keeping my fingers crossed it will fix most (all?) of my problems.

Well if you try it again, feel free to dump the logs into a github issue and I’ll take a look. Theoretically it should all be mostly the same if same model, container image, etc. Luckily LLMs are pretty good at translating these days, so feel free to comment in your native language in the github issue if that’s more comfortable. I do want to reach greater stability so we can all focus on more important things than orchestration!

1 Like

This is really strange. I’ll look into it again once I finish with deploying the new build pipeline.

2 Likes

Hmm… Just tested with the fresh build:

uvx llama-benchy --base-url http://spark:8888/v1 \
 --model Qwen/Qwen3.5-122B-A10B-FP8 \
 --depth 0 4096 8192 16384 32768 \
  --enable-prefix-caching
model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 2273.05 ± 759.96 1044.39 ± 409.76 1035.68 ± 409.76 1044.50 ± 409.77
Qwen/Qwen3.5-122B-A10B-FP8 tg32 31.03 ± 0.55 31.62 ± 0.87
Qwen/Qwen3.5-122B-A10B-FP8 ctx_pp @ d4096 2931.30 ± 185.01 1412.37 ± 92.90 1403.65 ± 92.90 1412.48 ± 92.87
Qwen/Qwen3.5-122B-A10B-FP8 ctx_tg @ d4096 30.78 ± 0.10 31.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 @ d4096 1471.69 ± 44.71 1401.62 ± 43.18 1392.91 ± 43.18 1401.72 ± 43.19
Qwen/Qwen3.5-122B-A10B-FP8 tg32 @ d4096 30.58 ± 0.15 31.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8 ctx_pp @ d8192 3127.19 ± 467.15 2695.77 ± 448.70 2687.06 ± 448.70 2695.87 ± 448.74
Qwen/Qwen3.5-122B-A10B-FP8 ctx_tg @ d8192 30.55 ± 0.24 31.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 @ d8192 1460.30 ± 5.18 1411.18 ± 4.97 1402.47 ± 4.97 1411.27 ± 4.94
Qwen/Qwen3.5-122B-A10B-FP8 tg32 @ d8192 30.36 ± 0.18 31.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8 ctx_pp @ d16384 3317.54 ± 4.61 4947.53 ± 6.86 4938.82 ± 6.86 4947.67 ± 6.86
Qwen/Qwen3.5-122B-A10B-FP8 ctx_tg @ d16384 30.50 ± 0.30 31.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 @ d16384 1166.53 ± 241.38 1855.62 ± 441.90 1846.90 ± 441.90 1855.71 ± 441.90
Qwen/Qwen3.5-122B-A10B-FP8 tg32 @ d16384 28.83 ± 2.05 29.67 ± 1.89
Qwen/Qwen3.5-122B-A10B-FP8 ctx_pp @ d32768 3212.37 ± 6.57 10209.63 ± 20.83 10200.91 ± 20.83 10209.71 ± 20.84
Qwen/Qwen3.5-122B-A10B-FP8 ctx_tg @ d32768 29.37 ± 0.15 30.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 @ d32768 1304.36 ± 22.37 1579.31 ± 27.23 1570.59 ± 27.23 1579.42 ± 27.26
Qwen/Qwen3.5-122B-A10B-FP8 tg32 @ d32768 29.70 ± 0.25 30.33 ± 0.47

llama-benchy (0.3.4)
date: 2026-03-04 16:28:44 | latency mode: api

Launch parameters:

./launch-cluster.sh -t vllm-node-nightly-20260304 --non-privileged \                                                                                                                                                                                                                                                                                                                                      
        exec vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
        --max-model-len 262144 \
        --gpu-memory-utilization 0.7 \
        --port 8888 --host 0.0.0.0 \
        --load-format fastsafetensors \
        --enable-prefix-caching \
        --enable-auto-tool-choice \
        --tool-call-parser qwen3_coder \
        --reasoning-parser qwen3 \
        -tp 2 --distributed-executor-backend ray \
        --max-num-batched-tokens 8192

You can get the same vLLM build by pulling the latest from the repo and running:

./build-and-copy.sh -t vllm-node-nightly-20260304

You can omit -t <image-name> in both launch and build commands to use default vllm-node.

Do containers built with spark-vllm-node no longer need a patch to run the Qwen 3.5 model?

Thank you! I will start with a completely fresh build and test to see if it would make any difference, but I am a bit skeptic at this point.

One good news thought. After some digging I was able to downgrade the SoC Firmware. The firmware update bumped SoC from 0x020090090x02009418. The new version causes NCCL all_gather to drop from ~21 GB/s to ~17 GB/s busbw, and causes severe TCP instability with wild retransmit storms across the ConnectX-7 link.

Steps for those suffering from this same problem. Keep in mind it takes about 10 minutes for the downgrade to finish!

# Check current version (should show 0x02009418 if affected)
fwupdmgr get-devices | grep -A5 "Integrated Baseband"

# Downgrade to known-good version
fwupdmgr downgrade

# Reboot and verify
fwupdmgr get-devices | grep -A5 "Integrated Baseband"
# Should show: Current version: 0x02009009

Results after FW downgrade FW 0x02009009:

# nccl-tests version 2.17.9 nccl-headers=22803 nccl-library=22803
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  24305 on     spark1 device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   6535 on     spark2 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
 17179869184    2147483648     float    none      -1   433857   39.60   19.80       0   395553   43.43   21.72       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 20.7577 
#
# Collective test concluded: all_gather_perf
#

Results pre-downgrade with latest FW 0x02009418:

# nccl-tests version 2.17.9 nccl-headers=22803 nccl-library=22803
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3981 on     spark1 device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   3715 on     spark2 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   531329   32.33   16.17       0   534517   32.14   16.07       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 16.1187
#
# Collective test concluded: all_gather_perf
#

You are still getting lower bandwidth than we were getting before the firmware update. Are you using both roceXXX twins in the NCCL test?

Also, what cable do you use?

The patch is only needed for Autoround quant. FP8 version runs without a patch.

I’m pretty sure I used to get around ~30 tok/s with Qwen3.5-122B-A10B-FP8, and ~26 tok/s with Qwen3.5-397B-A17B-int4-AutoRound.

But recently, for various reasons, I factory-restored both of my DGX Spark systems and reinstalled everything. After that, my NCCL test results changed — the Avg bus bandwidth now shows only ~15–16, and my model throughput also dropped:

  • Qwen3.5-122B-A10B-FP8: ~17 tok/s
  • Qwen3.5-397B-A17B-int4-AutoRound: ~14 tok/s

Could this drop be caused by the kernel (or kernel-related drivers/modules)?

Also, are there people who still get normal token throughput even when NCCL Avg bus bandwidth is reduced like this? and does anyone know how to downgrade the kernel to 6.14?

1 Like

Ops forgot to include that piece of the puzzle I guess. Here:

# Set network interface environment variables (use your active interface)
export UCX_NET_DEVICES=enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0

mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1   --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"   -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH   $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2