Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D

try to power off the system and fully unplug the power cable/adapter, wait about 5 mins, and run again.

It affects “faster” models more than more dense ones. Also, the problem seems to be in the firmware, not the kernel version. NVIDIA is aware, they found an issue and working on the fix which should be released shortly.

Personally, for most people I would recommend to wait rather than trying to downgrade the firmware and kernel versions.

OK, that looks normal. What cable are you using for interconnect?

I remember back in November when we were just trying to get it working in the cluster, there were some people with cables that didn’t work properly.

However, judging by your benchmarking scores from a few hours ago, it looks like 30 tok/s is being output normally even at reduced bandwidth. Doesn’t that suggest my issue might be?

Oh, actually look at @cho answer above.
There seems to be a separate power delivery issue where GPU doesn’t want to go above certain power level.

The fix seems to be to power everything off, disconnect the power brick from the wall and Spark, wait a few minutes, reconnect back and it will start working normally again. It’s important to disconnect the power brick from the wall, not just disconnect power cable from Spark, so it resets the power brick’s internal USB-C controller.

holy, what kind of magic is this? The latency has even decreased more than usual. Huge thanks to ‘cho’ and ‘eugr’, amazing.

As a developer myself, I work with code every day, but I’m genuinely amazed at how one even discovers solutions like this.

I’m unable to pinpoint exactly as I’ve both updated the eugr image, rebuilt vllm and flash-infer and tf5 and replaced the chat template with unsloth’s fixed version over the last few days, but today I decided to turn thinking back on for 122b-A10B int4 autoround and it no longer stops out of the blue for no apparent reason in Claude Code! It also definitely seems more intelligent with thinking turned on, it’s been struggling with a project of mine over the last few days but making good progress today.

I have 2 different cables. 1 (currently connected) directly purchased from Nvidia, and1 (spare) from www.naddod.com. Results are the same with both.

I just noticed your reply to Cho’s post about power source reset and I am about to test that out, which makes perfect sense to me now as I have noticed both of my sparks are not getting as hot as they used to do
 Will report soon my results, but quite sure it will be much better then what I get now.

Oh WOW!!! Holly guacamole. So the power reset fixes tons of issues. HUGE thanks to @cho and then @eugr for suggesting this USB/power reset!

  1. SCP speeds between my two sparks via the ConnectX interfaces is back to 700+MB/s
  2. Overall system responsiveness doubled
  3. Look at these latest llama-bency results below:
| model                      |            test |             t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------|----------------:|----------------:|-------------:|---------------:|---------------:|----------------:|
| Qwen/Qwen3.5-122B-A10B-FP8 |          pp2048 | 3199.61 ± 47.09 |              |  641.82 ± 9.57 |  640.64 ± 9.57 |   641.88 ± 9.55 |
| Qwen/Qwen3.5-122B-A10B-FP8 |            tg32 |    32.53 ± 0.05 | 33.58 ± 0.05 |                |                |                 |
| Qwen/Qwen3.5-122B-A10B-FP8 |  ctx_pp @ d4096 |  3205.98 ± 5.79 |              | 1279.12 ± 2.31 | 1277.93 ± 2.31 |  1279.20 ± 2.32 |
| Qwen/Qwen3.5-122B-A10B-FP8 |  ctx_tg @ d4096 |    32.51 ± 0.10 | 33.56 ± 0.11 |                |                |                 |
| Qwen/Qwen3.5-122B-A10B-FP8 |  pp2048 @ d4096 |  1580.85 ± 4.04 |              | 1296.70 ± 3.30 | 1295.52 ± 3.30 |  1296.77 ± 3.28 |
| Qwen/Qwen3.5-122B-A10B-FP8 |    tg32 @ d4096 |    32.35 ± 0.07 | 33.40 ± 0.07 |                |                |                 |
| Qwen/Qwen3.5-122B-A10B-FP8 |  ctx_pp @ d8192 |  3687.03 ± 2.19 |              | 2223.21 ± 1.44 | 2222.03 ± 1.44 |  2223.30 ± 1.45 |
| Qwen/Qwen3.5-122B-A10B-FP8 |  ctx_tg @ d8192 |    32.20 ± 0.03 | 33.24 ± 0.03 |                |                |                 |
| Qwen/Qwen3.5-122B-A10B-FP8 |  pp2048 @ d8192 |  1533.29 ± 2.15 |              | 1336.88 ± 1.87 | 1335.69 ± 1.87 |  1336.96 ± 1.89 |
| Qwen/Qwen3.5-122B-A10B-FP8 |    tg32 @ d8192 |    32.05 ± 0.06 | 33.08 ± 0.06 |                |                |                 |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d16384 |  3549.36 ± 4.08 |              | 4617.42 ± 5.43 | 4616.23 ± 5.43 |  4617.49 ± 5.42 |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d16384 |    31.71 ± 0.09 | 32.73 ± 0.09 |                |                |                 |
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d16384 |  1472.10 ± 6.10 |              | 1392.42 ± 5.76 | 1391.23 ± 5.76 |  1392.54 ± 5.74 |
| Qwen/Qwen3.5-122B-A10B-FP8 |   tg32 @ d16384 |    31.81 ± 0.04 | 32.84 ± 0.04 |                |                |                 |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d32768 |  3411.20 ± 3.03 |              | 9607.58 ± 8.39 | 9606.39 ± 8.39 |  9607.67 ± 8.42 |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d32768 |    31.12 ± 0.04 | 32.12 ± 0.04 |                |                |                 |
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d32768 |  1377.48 ± 1.45 |              | 1487.97 ± 1.56 | 1486.78 ± 1.56 |  1488.03 ± 1.57 |
| Qwen/Qwen3.5-122B-A10B-FP8 |   tg32 @ d32768 |    31.03 ± 0.12 | 31.75 ± 0.53 |                |                |                 |

llama-benchy (0.3.4)
date: 2026-03-05 15:43:16 | latency mode: api
  1. NCCL speed increased (a bit)
# nccl-tests version 2.17.9 nccl-headers=22803 nccl-library=22803
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  10912 on     spark1 device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   6529 on     spark2 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   370301   46.39   23.20       0   355331   48.35   24.17       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 23.6858
#
# Collective test concluded: all_gather_perf
#

And here is where it gets really interesting


  1. Look at the llama-bech gpt-oss-120b-mxfp4 test results pre and post the USB/power reset

Pre:

(base) kosta@spark1:~/working/llama.cpp$ build/bin/llama-bench \
>   -m /home/kosta/.cache/huggingface/hub/models--ggml-org--gpt-oss-120b-GGUF/snapshots/d932fcea62f83e088d8f076a2cd2d7eb02dfa682/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
>   -fa 1 \
>   -d 0,4096,8192,16384,32768 \
>   -p 2048 \
>   -n 32 \
>   -ub 2048 \
>   -mmp 0 \
>   -o md 2>&1 | tee /tmp/mxfp4_bench_results.log
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |          pp2048 |        831.35 ± 1.54 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |            tg32 |         31.49 ± 0.12 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d4096 |        783.61 ± 0.88 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d4096 |         28.20 ± 0.12 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d8192 |        739.70 ± 2.44 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d8192 |         28.44 ± 0.23 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d16384 |        659.57 ± 1.78 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d16384 |         27.43 ± 0.12 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d32768 |        547.69 ± 1.52 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d32768 |         24.44 ± 0.08 |

build: 24d2ee052 (8201)

Post;

(base) kosta@spark1:~/working/llama.cpp$ build/bin/llama-bench   -m /home/kosta/.cache/huggingface/hub/models--ggml-org--gpt-oss-120b-GGUF/snapshots/d932fcea62f83e088d8f076a2cd2d7eb02dfa682/gpt-oss-120b-mxfp4-00001-of-00003.gguf   -fa 1   -d 0,4096,8192,16384,32768   -p 2048   -n 32   -ub 2048   -mmp 0   -o md 2>&1 | tee /tmp/mxfp4_bench_results.log
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |          pp2048 |      2456.16 ± 10.28 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |            tg32 |         58.24 ± 0.43 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d4096 |       2333.90 ± 6.05 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d4096 |         55.71 ± 0.75 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d8192 |       2242.95 ± 7.48 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d8192 |         52.14 ± 1.05 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d16384 |       1943.94 ± 8.16 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d16384 |         49.17 ± 0.61 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d32768 |       1547.01 ± 6.22 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d32768 |         42.25 ± 0.47 |

build: 24d2ee052 (8201)

I am using the docker installation of https://github.com/eugr/spark-vllm-docke with the qwen 3.5 model on my solo spark DGX. Model works fine but I do not manage to get the openclaw tool calling working with the local model.

Has someone managed to do this?

Config in Openclaw:

“providers”: {

  "localqwen": {

“baseUrl”: “http://192.168.1.155:8888/v1”,

“auth”: “api-key”,

“api”: “openai-responses”,

“models”: [

      {

“id”: “Intel/Qwen3.5-122B-A10B-int4-AutoRound”,

“name”: “Qwen3.5 122B”,

“api”: “openai-responses”,

“reasoning”: true,

“input”: [

“text”

        \],

“cost”: {

“input”: 0,

“output”: 0,

“cacheRead”: 0,

“cacheWrite”: 0

        },

“contextWindow”: 120000,

“maxTokens”: 8192

      }

    \]

  }

}

Docker Start Config:

./launch-cluster.sh -t vllm-node-tf5 \

–apply-mod mods/fix-qwen3.5-autoround \

-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \

–solo exec vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound \

–max-model-len 262144 \

–gpu-memory-utilization 0.90 \

–enforce-eager \

–port 8888 \

–host 0.0.0.0 \

–load-format fastsafetensors \

–enable-auto-tool-choice \

–tool-call-parser qwen3_coder \

–reasoning-parser qwen3 \

–trust-remote-code

I don’t use OpenClaw, but tool calling works with other tools (and Qwen 3.5).

Have a look over here:

where the cowork of eugr’s container and helpers with OpenClaw has been discussed. There is an example to use with Qwen3 Coder Next that you could compare and adapt.

You can talk to the model via openclaw gateway in text? Does it respond to a picture?
I use “api”: “openai-completions”, “reasoning”: false,
I do not run your model. I run the quen 3 coder variant.
I would need to see the log of vllm and openclaw to look into.
Good luck!

Could be related to Responses endpoint in vLLM - can you switch to "api": "openai-completions" and try again?

thanks i have it working now with api: openai-completions, but what i needed to change as well is to use the proper parser on the vllm side: --enable-auto-tool-choice --tool-call-parser qwen3_xml.

That’s interesting!

I’ve run into an odd problem which appears to be linked to running the Playwright MCP.

I’ve been having good success with the Qwen/Qwen3.5-122B-A10B-FP8 model, cruising along at around 30 tps with Claude Code. However, I started looking at some UI code and added the Playwright MCP so that the model could verify the style changes being made.

Suddenly, the generation rate plummeted to 1 or 2 tps and never recovered. Is this perhaps related to the fact that the MCP captures images? It doesn’t seem to be doing that all the time, so I don’t know what the problem is. I might try disabling the snapshot capability and see if that makes a difference.

Here’s my startup. Is there anything dramatically wrong here?

cd spark-vllm-docker
./launch-cluster.sh \
  -e HF_TOKEN \
  -t vllm-node-q35 \
  exec vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --enable-prefix-caching \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --gpu-memory-utilization 0.7 \
    --max-model-len auto \
    --no-enable-log-requests \
    --max-num-seqs 2 \
    --max-num-batched-tokens 8192 \
    --load-format fastsafetensors \
    --attention-backend flashinfer \
    --mm-encoder-tp-mode data \
    --mm-processor-cache-gb 0 \
    --distributed-executor-backend ray \
    --served-model-name Qwen3.5-122B-A10B \
    --host 0.0.0.0 \
    --port 8000

Nothing wrong with this one, but is there any reason to limit concurrency here, given that that’s where Spark shines?

As for you main question - did generation speed also suffer after restarting vLLM, or did it get back to normal? What was GPU load showing at the time - any “stuck” requests?

No reason to limit it except that for a single user, I thought it may help.

Some further observations. I’m beginning to suspect that it’s Claude Code that’s causing a problem. I’m routing everything through LiteLLM and that’s doing a translation from the Anthropic API to the OpenAI API and maybe that’s the problem. My guess is that this is triggering some memory leak in vLLM that degrades performance.

I’ve switched back to OpenCode, and so far, no problems.

Maybe I spoke too soon. The token rate has just dropped significantly during a session from about 27-30 tps down to 1-2 tps. The drop is sudden and not gradual. I might take LiteLLM out of the loop to see if that helps.

FYI, Claude Code works just fine with vLLM directly, no LiteLLM needed.