Personally i use OWUI with different workspace with 20 mcp per workspace so i want to try !
I’M AT HOME IN FEW MINUTES
Personally i use OWUI with different workspace with 20 mcp per workspace so i want to try !
I’M AT HOME IN FEW MINUTES
@cosinus you’re in
Not yet :) Maybe later today. Do you have two sparks? It won’t fit on a single one.
All thinking Qwen models tend to think too much :)
@eugr sadly for now i only have one :(
Then you need a 4-bit quant. You can try this one: QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face
There are some NVFP4 quants, but I’d wait for one from a reputable quant maker.
Or wait for INT4-autoround quants, looks like they may work better.
It looks like this model is pretty tricky to quantize, given that all quants (other than FP8) are larger in size than usual.
I was able to install and test Qwen3.5-35B-A3B, I uploaded the instructions here: GitHub - adadrag/qwen3.5-dgx-spark: Complete guide to running Qwen3.5-35B-A3B on NVIDIA DGX Spark (GB10) with vLLM - installation, benchmarks, vision features, and troubleshooting
Just ask your claude code to install it using my link
Tried to run Sehyo/Qwen3.5-122B-A10B-NVFP4 · Hugging Face with gpu memory allocation of 0.7 on dual setup. Latest/fresh spark-vllm-docker build, got past loading safetensors and then a hard lockup. Can’t SSH to the box anymore so will need to wait until I get home to try again :-(
Anybody else had luck with this or other quants of this model?
Running a full benchmark now, but here is FP8 performance in the cluster:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 | 2568.53 ± 771.70 | 910.06 ± 343.54 | 900.94 ± 343.54 | 910.27 ± 343.54 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 | 29.60 ± 2.29 | 31.82 ± 0.60 |
llama-benchy (0.3.1)
date: 2026-02-25 11:55:36 | latency mode: api
We’ll be making recipes for all these, but for this model you need to increase max batched tokens, otherwise it will complain on start. Here is a working launch command (modify for your needs).
./launch-cluster.sh -t vllm-node-20260225 \
exec vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
--max-model-len 262144 \
--gpu-memory-utilization 0.7 \
--port 8888 --host 0.0.0.0 \
--load-format fastsafetensors \
--enable-prefix-caching \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
-tp 2 --distributed-executor-backend ray \
--max-num-batched-tokens 8192
Yeah - I got the error on minimum batched tokens related to mamba and fixed that. Something else further in the process locked it up. Also see QuantTrio has an AWQ with a config for speculative decoding
My head Spark suddenly shut down when I was almost at the last stage of my benchmarking (going through 200K context). Trying again.
you’ve got to be kidding me. Now the second one crashed at about the same point.
Same here it seems. One server disconnect at pp8192/tg128
Correction it happend on Intel/Qwen3-Coder-Next-int4-AutoRound, did a spin up of this and Qwen/Qwen3.5-122B-A10B-FP8 so mixed them up. Fresh build a few hours ago so maybe its upstream
OOM error?
No OOM, no thermal runaway, I’ve been monitoring temps, they got to 85C, but not more… I even ran in --non-privileged and 0.7 memory utilization. No swap used.
Looks like watchdog strikes again. It didn’t get a ping within 10s and shut down the machine.
Well, at least it crashed during the last llama-benchy iteration, so I have a full benchmark:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 | 2367.59 ± 704.72 | 966.98 ± 319.38 | 958.59 ± 319.38 | 967.15 ± 319.39 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 | 30.10 ± 2.05 | 32.05 ± 0.74 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d4096 | 3006.82 ± 177.43 | 1376.05 ± 84.38 | 1367.66 ± 84.38 | 1376.21 ± 84.34 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d4096 | 31.28 ± 0.12 | 32.29 ± 0.13 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d4096 | 1517.78 ± 19.97 | 1357.96 ± 17.82 | 1349.57 ± 17.82 | 1358.38 ± 17.57 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 @ d4096 | 31.92 ± 0.65 | 32.96 ± 0.67 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d8192 | 3577.72 ± 15.48 | 2298.62 ± 10.15 | 2290.24 ± 10.15 | 2298.78 ± 10.10 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d8192 | 31.44 ± 0.55 | 32.46 ± 0.57 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d8192 | 1507.92 ± 4.20 | 1366.55 ± 3.78 | 1358.17 ± 3.78 | 1366.71 ± 3.78 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 @ d8192 | 30.94 ± 0.05 | 31.33 ± 0.47 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d16384 | 3441.90 ± 16.78 | 4768.95 ± 23.25 | 4760.57 ± 23.25 | 4769.09 ± 23.26 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d16384 | 31.14 ± 0.81 | 31.78 ± 1.10 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d16384 | 1424.31 ± 4.03 | 1446.28 ± 4.06 | 1437.90 ± 4.06 | 1446.50 ± 4.06 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 @ d16384 | 30.59 ± 0.09 | 31.00 ± 0.00 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d65535 | 2877.89 ± 16.82 | 22781.44 ± 132.84 | 22773.05 ± 132.84 | 22781.57 ± 132.92 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d65535 | 29.00 ± 0.09 | 29.67 ± 0.47 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d65535 | 1223.28 ± 22.50 | 1683.15 ± 31.21 | 1674.77 ± 31.21 | 1683.27 ± 31.18 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 @ d65535 | 29.13 ± 0.84 | 29.67 ± 0.94 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d100000 | 2443.81 ± 78.48 | 40971.69 ± 1344.62 | 40963.31 ± 1344.62 | 40971.91 ± 1344.73 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d100000 | 28.49 ± 0.41 | 29.33 ± 0.47 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d100000 | 830.49 ± 12.11 | 2474.92 ± 36.20 | 2466.53 ± 36.20 | 2475.06 ± 36.20 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 @ d100000 | 28.03 ± 0.11 | 29.00 ± 0.00 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_pp @ d200000 | 1835.82 ± 11.99 | 108956.87 ± 711.41 | 108948.49 ± 711.41 | 108957.28 ± 711.62 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | ctx_tg @ d200000 | 24.93 ± 0.06 | 26.00 ± 0.00 | |||
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 @ d200000 | 609.40 ± 5.58 | 3369.37 ± 30.79 | 3360.98 ± 30.79 | 3370.14 ± 30.12 | |
| Qwen/Qwen3.5-122B-A10B-FP8 | tg32 @ d200000 | 24.41 ± 0.20 | 25.00 ± 0.00 |
llama-benchy (0.3.1)
date: 2026-02-25 12:27:44 | latency mode: api
Seems that the AWQ versions need a patch. The vllm-node version failed with the rope utils for me, standard nightly worked.
# upgrade transformers so that applications could properly execute tool calls
pip install -U "transformers @ git+https://github.com/huggingface/transformers.git@f2ba019"
# locate modeling_rope_utils.py line 651 to fix a simple bug
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && echo "$TF_FILE"
NEW_LINE=' ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"
Found that on QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face
and as deps
vllm>=0.16.0rc2.dev447
transformers>=5.3.0.dev0
Will look into that tomorrow.
ah, so need to run this with transformers 5 then.
need to build vllm-node with --tf5 flag
i try with that and report back, i try with –tf5 flag for the build of vllm-node :)
Seems that the AWQ versions need a patch. The vllm-node version failed with the rope utils for me, standard nightly worked.
# upgrade transformers so that applications could properly execute tool calls pip install -U "transformers @ git+https://github.com/huggingface/transformers.git@f2ba019" # locate modeling_rope_utils.py line 651 to fix a simple bug TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && echo "$TF_FILE" NEW_LINE=' ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \ perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"Found that on QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face
and as deps
vllm>=0.16.0rc2.dev447 transformers>=5.3.0.dev0Will look into that tomorrow.
Simple stupid question: what does that mean for eugr/spark-vllm-docker and ./build-and-copy.sh?
./build-and-copy.sh --tf5 ??