try to power off the system and fully unplug the power cable/adapter, wait about 5 mins, and run again.
It affects âfasterâ models more than more dense ones. Also, the problem seems to be in the firmware, not the kernel version. NVIDIA is aware, they found an issue and working on the fix which should be released shortly.
Personally, for most people I would recommend to wait rather than trying to downgrade the firmware and kernel versions.
OK, that looks normal. What cable are you using for interconnect?
I remember back in November when we were just trying to get it working in the cluster, there were some people with cables that didnât work properly.
However, judging by your benchmarking scores from a few hours ago, it looks like 30 tok/s is being output normally even at reduced bandwidth. Doesnât that suggest my issue might be?
Oh, actually look at @cho answer above.
There seems to be a separate power delivery issue where GPU doesnât want to go above certain power level.
The fix seems to be to power everything off, disconnect the power brick from the wall and Spark, wait a few minutes, reconnect back and it will start working normally again. Itâs important to disconnect the power brick from the wall, not just disconnect power cable from Spark, so it resets the power brickâs internal USB-C controller.
holy, what kind of magic is this? The latency has even decreased more than usual. Huge thanks to âchoâ and âeugrâ, amazing.
As a developer myself, I work with code every day, but Iâm genuinely amazed at how one even discovers solutions like this.
Iâm unable to pinpoint exactly as Iâve both updated the eugr image, rebuilt vllm and flash-infer and tf5 and replaced the chat template with unslothâs fixed version over the last few days, but today I decided to turn thinking back on for 122b-A10B int4 autoround and it no longer stops out of the blue for no apparent reason in Claude Code! It also definitely seems more intelligent with thinking turned on, itâs been struggling with a project of mine over the last few days but making good progress today.
I have 2 different cables. 1 (currently connected) directly purchased from Nvidia, and1 (spare) from www.naddod.com. Results are the same with both.
I just noticed your reply to Choâs post about power source reset and I am about to test that out, which makes perfect sense to me now as I have noticed both of my sparks are not getting as hot as they used to do⊠Will report soon my results, but quite sure it will be much better then what I get now.
Oh WOW!!! Holly guacamole. So the power reset fixes tons of issues. HUGE thanks to @cho and then @eugr for suggesting this USB/power reset!
- SCP speeds between my two sparks via the ConnectX interfaces is back to 700+MB/s
- Overall system responsiveness doubled
- Look at these latest llama-bency results below:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:---------------------------|----------------:|----------------:|-------------:|---------------:|---------------:|----------------:|
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 | 3199.61 ± 47.09 | | 641.82 ± 9.57 | 640.64 ± 9.57 | 641.88 ± 9.55 |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 | 32.53 ± 0.05 | 33.58 ± 0.05 | | | |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d4096 | 3205.98 ± 5.79 | | 1279.12 ± 2.31 | 1277.93 ± 2.31 | 1279.20 ± 2.32 |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d4096 | 32.51 ± 0.10 | 33.56 ± 0.11 | | | |
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d4096 | 1580.85 ± 4.04 | | 1296.70 ± 3.30 | 1295.52 ± 3.30 | 1296.77 ± 3.28 |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 @ d4096 | 32.35 ± 0.07 | 33.40 ± 0.07 | | | |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d8192 | 3687.03 ± 2.19 | | 2223.21 ± 1.44 | 2222.03 ± 1.44 | 2223.30 ± 1.45 |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d8192 | 32.20 ± 0.03 | 33.24 ± 0.03 | | | |
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d8192 | 1533.29 ± 2.15 | | 1336.88 ± 1.87 | 1335.69 ± 1.87 | 1336.96 ± 1.89 |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 @ d8192 | 32.05 ± 0.06 | 33.08 ± 0.06 | | | |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d16384 | 3549.36 ± 4.08 | | 4617.42 ± 5.43 | 4616.23 ± 5.43 | 4617.49 ± 5.42 |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d16384 | 31.71 ± 0.09 | 32.73 ± 0.09 | | | |
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d16384 | 1472.10 ± 6.10 | | 1392.42 ± 5.76 | 1391.23 ± 5.76 | 1392.54 ± 5.74 |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 @ d16384 | 31.81 ± 0.04 | 32.84 ± 0.04 | | | |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d32768 | 3411.20 ± 3.03 | | 9607.58 ± 8.39 | 9606.39 ± 8.39 | 9607.67 ± 8.42 |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d32768 | 31.12 ± 0.04 | 32.12 ± 0.04 | | | |
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d32768 | 1377.48 ± 1.45 | | 1487.97 ± 1.56 | 1486.78 ± 1.56 | 1488.03 ± 1.57 |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 @ d32768 | 31.03 ± 0.12 | 31.75 ± 0.53 | | | |
llama-benchy (0.3.4)
date: 2026-03-05 15:43:16 | latency mode: api
- NCCL speed increased (a bit)
# nccl-tests version 2.17.9 nccl-headers=22803 nccl-library=22803
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 10912 on spark1 device 0 [000f:01:00] NVIDIA GB10
# Rank 1 Group 0 Pid 6529 on spark2 device 0 [000f:01:00] NVIDIA GB10
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
17179869184 2147483648 float none -1 370301 46.39 23.20 0 355331 48.35 24.17 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 23.6858
#
# Collective test concluded: all_gather_perf
#
And here is where it gets really interestingâŠ
- Look at the llama-bech gpt-oss-120b-mxfp4 test results pre and post the USB/power reset
Pre:
(base) kosta@spark1:~/working/llama.cpp$ build/bin/llama-bench \
> -m /home/kosta/.cache/huggingface/hub/models--ggml-org--gpt-oss-120b-GGUF/snapshots/d932fcea62f83e088d8f076a2cd2d7eb02dfa682/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
> -fa 1 \
> -d 0,4096,8192,16384,32768 \
> -p 2048 \
> -n 32 \
> -ub 2048 \
> -mmp 0 \
> -o md 2>&1 | tee /tmp/mxfp4_bench_results.log
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 | 831.35 ± 1.54 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 | 31.49 ± 0.12 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 783.61 ± 0.88 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 28.20 ± 0.12 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 739.70 ± 2.44 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 28.44 ± 0.23 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 659.57 ± 1.78 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 27.43 ± 0.12 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 547.69 ± 1.52 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 24.44 ± 0.08 |
build: 24d2ee052 (8201)
Post;
(base) kosta@spark1:~/working/llama.cpp$ build/bin/llama-bench -m /home/kosta/.cache/huggingface/hub/models--ggml-org--gpt-oss-120b-GGUF/snapshots/d932fcea62f83e088d8f076a2cd2d7eb02dfa682/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0 -o md 2>&1 | tee /tmp/mxfp4_bench_results.log
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 | 2456.16 ± 10.28 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 | 58.24 ± 0.43 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 2333.90 ± 6.05 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 55.71 ± 0.75 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 2242.95 ± 7.48 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 52.14 ± 1.05 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 1943.94 ± 8.16 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 49.17 ± 0.61 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 1547.01 ± 6.22 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 42.25 ± 0.47 |
build: 24d2ee052 (8201)
I am using the docker installation of https://github.com/eugr/spark-vllm-docke with the qwen 3.5 model on my solo spark DGX. Model works fine but I do not manage to get the openclaw tool calling working with the local model.
Has someone managed to do this?
Config in Openclaw:
âprovidersâ: {
"localqwen": {
âbaseUrlâ: âhttp://192.168.1.155:8888/v1â,
âauthâ: âapi-keyâ,
âapiâ: âopenai-responsesâ,
âmodelsâ: [
{
âidâ: âIntel/Qwen3.5-122B-A10B-int4-AutoRoundâ,
ânameâ: âQwen3.5 122Bâ,
âapiâ: âopenai-responsesâ,
âreasoningâ: true,
âinputâ: [
âtextâ
\],
âcostâ: {
âinputâ: 0,
âoutputâ: 0,
âcacheReadâ: 0,
âcacheWriteâ: 0
},
âcontextWindowâ: 120000,
âmaxTokensâ: 8192
}
\]
}
}
Docker Start Config:
./launch-cluster.sh -t vllm-node-tf5 \
âapply-mod mods/fix-qwen3.5-autoround \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
âsolo exec vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound \
âmax-model-len 262144 \
âgpu-memory-utilization 0.90 \
âenforce-eager \
âport 8888 \
âhost 0.0.0.0 \
âload-format fastsafetensors \
âenable-auto-tool-choice \
âtool-call-parser qwen3_coder \
âreasoning-parser qwen3 \
âtrust-remote-code
I donât use OpenClaw, but tool calling works with other tools (and Qwen 3.5).
Have a look over here:
where the cowork of eugrâs container and helpers with OpenClaw has been discussed. There is an example to use with Qwen3 Coder Next that you could compare and adapt.
You can talk to the model via openclaw gateway in text? Does it respond to a picture?
I use âapiâ: âopenai-completionsâ, âreasoningâ: false,
I do not run your model. I run the quen 3 coder variant.
I would need to see the log of vllm and openclaw to look into.
Good luck!
Could be related to Responses endpoint in vLLM - can you switch to "api": "openai-completions" and try again?
thanks i have it working now with api: openai-completions, but what i needed to change as well is to use the proper parser on the vllm side: --enable-auto-tool-choice --tool-call-parser qwen3_xml.
Thatâs interesting!
Iâve run into an odd problem which appears to be linked to running the Playwright MCP.
Iâve been having good success with the Qwen/Qwen3.5-122B-A10B-FP8 model, cruising along at around 30 tps with Claude Code. However, I started looking at some UI code and added the Playwright MCP so that the model could verify the style changes being made.
Suddenly, the generation rate plummeted to 1 or 2 tps and never recovered. Is this perhaps related to the fact that the MCP captures images? It doesnât seem to be doing that all the time, so I donât know what the problem is. I might try disabling the snapshot capability and see if that makes a difference.
Hereâs my startup. Is there anything dramatically wrong here?
cd spark-vllm-docker
./launch-cluster.sh \
-e HF_TOKEN \
-t vllm-node-q35 \
exec vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
--tensor-parallel-size 2 \
--trust-remote-code \
--enable-auto-tool-choice \
--enable-prefix-caching \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--gpu-memory-utilization 0.7 \
--max-model-len auto \
--no-enable-log-requests \
--max-num-seqs 2 \
--max-num-batched-tokens 8192 \
--load-format fastsafetensors \
--attention-backend flashinfer \
--mm-encoder-tp-mode data \
--mm-processor-cache-gb 0 \
--distributed-executor-backend ray \
--served-model-name Qwen3.5-122B-A10B \
--host 0.0.0.0 \
--port 8000
Nothing wrong with this one, but is there any reason to limit concurrency here, given that thatâs where Spark shines?
As for you main question - did generation speed also suffer after restarting vLLM, or did it get back to normal? What was GPU load showing at the time - any âstuckâ requests?
No reason to limit it except that for a single user, I thought it may help.
Some further observations. Iâm beginning to suspect that itâs Claude Code thatâs causing a problem. Iâm routing everything through LiteLLM and thatâs doing a translation from the Anthropic API to the OpenAI API and maybe thatâs the problem. My guess is that this is triggering some memory leak in vLLM that degrades performance.
Iâve switched back to OpenCode, and so far, no problems.
Maybe I spoke too soon. The token rate has just dropped significantly during a session from about 27-30 tps down to 1-2 tps. The drop is sudden and not gradual. I might take LiteLLM out of the loop to see if that helps.
FYI, Claude Code works just fine with vLLM directly, no LiteLLM needed.