Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D

Personally i use OWUI with different workspace with 20 mcp per workspace so i want to try !

I’M AT HOME IN FEW MINUTES

@cosinus you’re in

Not yet :) Maybe later today. Do you have two sparks? It won’t fit on a single one.

All thinking Qwen models tend to think too much :)

@eugr sadly for now i only have one :(

Then you need a 4-bit quant. You can try this one: QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face
There are some NVFP4 quants, but I’d wait for one from a reputable quant maker.

Or wait for INT4-autoround quants, looks like they may work better.

It looks like this model is pretty tricky to quantize, given that all quants (other than FP8) are larger in size than usual.

I was able to install and test Qwen3.5-35B-A3B, I uploaded the instructions here: GitHub - adadrag/qwen3.5-dgx-spark: Complete guide to running Qwen3.5-35B-A3B on NVIDIA DGX Spark (GB10) with vLLM - installation, benchmarks, vision features, and troubleshooting

Just ask your claude code to install it using my link

Tried to run Sehyo/Qwen3.5-122B-A10B-NVFP4 · Hugging Face with gpu memory allocation of 0.7 on dual setup. Latest/fresh spark-vllm-docker build, got past loading safetensors and then a hard lockup. Can’t SSH to the box anymore so will need to wait until I get home to try again :-(

Anybody else had luck with this or other quants of this model?

Running a full benchmark now, but here is FP8 performance in the cluster:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 2568.53 ± 771.70 910.06 ± 343.54 900.94 ± 343.54 910.27 ± 343.54
Qwen/Qwen3.5-122B-A10B-FP8 tg32 29.60 ± 2.29 31.82 ± 0.60

llama-benchy (0.3.1)
date: 2026-02-25 11:55:36 | latency mode: api

We’ll be making recipes for all these, but for this model you need to increase max batched tokens, otherwise it will complain on start. Here is a working launch command (modify for your needs).

 ./launch-cluster.sh -t vllm-node-20260225 \
exec vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
--max-model-len 262144 \
--gpu-memory-utilization 0.7 \
--port 8888 --host 0.0.0.0 \
--load-format fastsafetensors \
--enable-prefix-caching \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
-tp 2 --distributed-executor-backend ray \
--max-num-batched-tokens 8192

Yeah - I got the error on minimum batched tokens related to mamba and fixed that. Something else further in the process locked it up. Also see QuantTrio has an AWQ with a config for speculative decoding

My head Spark suddenly shut down when I was almost at the last stage of my benchmarking (going through 200K context). Trying again.

you’ve got to be kidding me. Now the second one crashed at about the same point.

Same here it seems. One server disconnect at pp8192/tg128
Correction it happend on Intel/Qwen3-Coder-Next-int4-AutoRound, did a spin up of this and Qwen/Qwen3.5-122B-A10B-FP8 so mixed them up. Fresh build a few hours ago so maybe its upstream

OOM error?

No OOM, no thermal runaway, I’ve been monitoring temps, they got to 85C, but not more… I even ran in --non-privileged and 0.7 memory utilization. No swap used.
Looks like watchdog strikes again. It didn’t get a ping within 10s and shut down the machine.

Well, at least it crashed during the last llama-benchy iteration, so I have a full benchmark:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 2367.59 ± 704.72 966.98 ± 319.38 958.59 ± 319.38 967.15 ± 319.39
Qwen/Qwen3.5-122B-A10B-FP8 tg32 30.10 ± 2.05 32.05 ± 0.74
Qwen/Qwen3.5-122B-A10B-FP8 ctx_pp @ d4096 3006.82 ± 177.43 1376.05 ± 84.38 1367.66 ± 84.38 1376.21 ± 84.34
Qwen/Qwen3.5-122B-A10B-FP8 ctx_tg @ d4096 31.28 ± 0.12 32.29 ± 0.13
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 @ d4096 1517.78 ± 19.97 1357.96 ± 17.82 1349.57 ± 17.82 1358.38 ± 17.57
Qwen/Qwen3.5-122B-A10B-FP8 tg32 @ d4096 31.92 ± 0.65 32.96 ± 0.67
Qwen/Qwen3.5-122B-A10B-FP8 ctx_pp @ d8192 3577.72 ± 15.48 2298.62 ± 10.15 2290.24 ± 10.15 2298.78 ± 10.10
Qwen/Qwen3.5-122B-A10B-FP8 ctx_tg @ d8192 31.44 ± 0.55 32.46 ± 0.57
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 @ d8192 1507.92 ± 4.20 1366.55 ± 3.78 1358.17 ± 3.78 1366.71 ± 3.78
Qwen/Qwen3.5-122B-A10B-FP8 tg32 @ d8192 30.94 ± 0.05 31.33 ± 0.47
Qwen/Qwen3.5-122B-A10B-FP8 ctx_pp @ d16384 3441.90 ± 16.78 4768.95 ± 23.25 4760.57 ± 23.25 4769.09 ± 23.26
Qwen/Qwen3.5-122B-A10B-FP8 ctx_tg @ d16384 31.14 ± 0.81 31.78 ± 1.10
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 @ d16384 1424.31 ± 4.03 1446.28 ± 4.06 1437.90 ± 4.06 1446.50 ± 4.06
Qwen/Qwen3.5-122B-A10B-FP8 tg32 @ d16384 30.59 ± 0.09 31.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8 ctx_pp @ d65535 2877.89 ± 16.82 22781.44 ± 132.84 22773.05 ± 132.84 22781.57 ± 132.92
Qwen/Qwen3.5-122B-A10B-FP8 ctx_tg @ d65535 29.00 ± 0.09 29.67 ± 0.47
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 @ d65535 1223.28 ± 22.50 1683.15 ± 31.21 1674.77 ± 31.21 1683.27 ± 31.18
Qwen/Qwen3.5-122B-A10B-FP8 tg32 @ d65535 29.13 ± 0.84 29.67 ± 0.94
Qwen/Qwen3.5-122B-A10B-FP8 ctx_pp @ d100000 2443.81 ± 78.48 40971.69 ± 1344.62 40963.31 ± 1344.62 40971.91 ± 1344.73
Qwen/Qwen3.5-122B-A10B-FP8 ctx_tg @ d100000 28.49 ± 0.41 29.33 ± 0.47
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 @ d100000 830.49 ± 12.11 2474.92 ± 36.20 2466.53 ± 36.20 2475.06 ± 36.20
Qwen/Qwen3.5-122B-A10B-FP8 tg32 @ d100000 28.03 ± 0.11 29.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8 ctx_pp @ d200000 1835.82 ± 11.99 108956.87 ± 711.41 108948.49 ± 711.41 108957.28 ± 711.62
Qwen/Qwen3.5-122B-A10B-FP8 ctx_tg @ d200000 24.93 ± 0.06 26.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8 pp2048 @ d200000 609.40 ± 5.58 3369.37 ± 30.79 3360.98 ± 30.79 3370.14 ± 30.12
Qwen/Qwen3.5-122B-A10B-FP8 tg32 @ d200000 24.41 ± 0.20 25.00 ± 0.00

llama-benchy (0.3.1)
date: 2026-02-25 12:27:44 | latency mode: api

Seems that the AWQ versions need a patch. The vllm-node version failed with the rope utils for me, standard nightly worked.

# upgrade transformers so that applications could properly execute tool calls
pip install -U "transformers @ git+https://github.com/huggingface/transformers.git@f2ba019"
# locate modeling_rope_utils.py line 651 to fix a simple bug
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && echo "$TF_FILE"
NEW_LINE='            ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"

Found that on QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face

and as deps

vllm>=0.16.0rc2.dev447
transformers>=5.3.0.dev0

Will look into that tomorrow.

ah, so need to run this with transformers 5 then.
need to build vllm-node with --tf5 flag

i try with that and report back, i try with –tf5 flag for the build of vllm-node :)

Seems that the AWQ versions need a patch. The vllm-node version failed with the rope utils for me, standard nightly worked.

# upgrade transformers so that applications could properly execute tool calls
pip install -U "transformers @ git+https://github.com/huggingface/transformers.git@f2ba019"
# locate modeling_rope_utils.py line 651 to fix a simple bug
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && echo "$TF_FILE"
NEW_LINE='            ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"

Found that on QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face

and as deps

vllm>=0.16.0rc2.dev447
transformers>=5.3.0.dev0

Will look into that tomorrow.

Simple stupid question: what does that mean for eugr/spark-vllm-docker and ./build-and-copy.sh?

./build-and-copy.sh --tf5 ??