Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D

ptichalouf · February 25, 2026, 7:01pm

Personally i use OWUI with different workspace with 20 mcp per workspace so i want to try !

I’M AT HOME IN FEW MINUTES

raphael.amorim · February 25, 2026, 7:01pm

@cosinus you’re in

eugr · February 25, 2026, 7:05pm

Not yet :) Maybe later today. Do you have two sparks? It won’t fit on a single one.

eugr · February 25, 2026, 7:07pm

All thinking Qwen models tend to think too much :)

ptichalouf · February 25, 2026, 7:08pm

@eugr sadly for now i only have one :(

eugr · February 25, 2026, 7:12pm

Then you need a 4-bit quant. You can try this one: QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face
There are some NVFP4 quants, but I’d wait for one from a reputable quant maker.

Or wait for INT4-autoround quants, looks like they may work better.

It looks like this model is pretty tricky to quantize, given that all quants (other than FP8) are larger in size than usual.

fixadvicedevices · February 25, 2026, 7:20pm

I was able to install and test Qwen3.5-35B-A3B, I uploaded the instructions here: GitHub - adadrag/qwen3.5-dgx-spark: Complete guide to running Qwen3.5-35B-A3B on NVIDIA DGX Spark (GB10) with vLLM - installation, benchmarks, vision features, and troubleshooting

Just ask your claude code to install it using my link

josephbreda · February 25, 2026, 8:04pm

Tried to run Sehyo/Qwen3.5-122B-A10B-NVFP4 · Hugging Face with gpu memory allocation of 0.7 on dual setup. Latest/fresh spark-vllm-docker build, got past loading safetensors and then a hard lockup. Can’t SSH to the box anymore so will need to wait until I get home to try again :-(

Anybody else had luck with this or other quants of this model?

eugr · February 25, 2026, 8:07pm

Running a full benchmark now, but here is FP8 performance in the cluster:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3.5-122B-A10B-FP8	pp2048	2568.53 ± 771.70		910.06 ± 343.54	900.94 ± 343.54	910.27 ± 343.54
Qwen/Qwen3.5-122B-A10B-FP8	tg32	29.60 ± 2.29	31.82 ± 0.60

llama-benchy (0.3.1)
date: 2026-02-25 11:55:36 | latency mode: api

We’ll be making recipes for all these, but for this model you need to increase max batched tokens, otherwise it will complain on start. Here is a working launch command (modify for your needs).

 ./launch-cluster.sh -t vllm-node-20260225 \
exec vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
--max-model-len 262144 \
--gpu-memory-utilization 0.7 \
--port 8888 --host 0.0.0.0 \
--load-format fastsafetensors \
--enable-prefix-caching \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
-tp 2 --distributed-executor-backend ray \
--max-num-batched-tokens 8192

josephbreda · February 25, 2026, 8:24pm

Yeah - I got the error on minimum batched tokens related to mamba and fixed that. Something else further in the process locked it up. Also see QuantTrio has an AWQ with a config for speculative decoding

eugr · February 25, 2026, 8:29pm

My head Spark suddenly shut down when I was almost at the last stage of my benchmarking (going through 200K context). Trying again.

eugr · February 25, 2026, 8:42pm

you’ve got to be kidding me. Now the second one crashed at about the same point.

grindstone · February 25, 2026, 8:42pm

Same here it seems. One server disconnect at pp8192/tg128
Correction it happend on Intel/Qwen3-Coder-Next-int4-AutoRound, did a spin up of this and Qwen/Qwen3.5-122B-A10B-FP8 so mixed them up. Fresh build a few hours ago so maybe its upstream

raphael.amorim · February 25, 2026, 8:49pm

OOM error?

eugr · February 25, 2026, 8:58pm

No OOM, no thermal runaway, I’ve been monitoring temps, they got to 85C, but not more… I even ran in --non-privileged and 0.7 memory utilization. No swap used.
Looks like watchdog strikes again. It didn’t get a ping within 10s and shut down the machine.

eugr · February 25, 2026, 9:02pm

Well, at least it crashed during the last llama-benchy iteration, so I have a full benchmark:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3.5-122B-A10B-FP8	pp2048	2367.59 ± 704.72		966.98 ± 319.38	958.59 ± 319.38	967.15 ± 319.39
Qwen/Qwen3.5-122B-A10B-FP8	tg32	30.10 ± 2.05	32.05 ± 0.74
Qwen/Qwen3.5-122B-A10B-FP8	ctx_pp @ d4096	3006.82 ± 177.43		1376.05 ± 84.38	1367.66 ± 84.38	1376.21 ± 84.34
Qwen/Qwen3.5-122B-A10B-FP8	ctx_tg @ d4096	31.28 ± 0.12	32.29 ± 0.13
Qwen/Qwen3.5-122B-A10B-FP8	pp2048 @ d4096	1517.78 ± 19.97		1357.96 ± 17.82	1349.57 ± 17.82	1358.38 ± 17.57
Qwen/Qwen3.5-122B-A10B-FP8	tg32 @ d4096	31.92 ± 0.65	32.96 ± 0.67
Qwen/Qwen3.5-122B-A10B-FP8	ctx_pp @ d8192	3577.72 ± 15.48		2298.62 ± 10.15	2290.24 ± 10.15	2298.78 ± 10.10
Qwen/Qwen3.5-122B-A10B-FP8	ctx_tg @ d8192	31.44 ± 0.55	32.46 ± 0.57
Qwen/Qwen3.5-122B-A10B-FP8	pp2048 @ d8192	1507.92 ± 4.20		1366.55 ± 3.78	1358.17 ± 3.78	1366.71 ± 3.78
Qwen/Qwen3.5-122B-A10B-FP8	tg32 @ d8192	30.94 ± 0.05	31.33 ± 0.47
Qwen/Qwen3.5-122B-A10B-FP8	ctx_pp @ d16384	3441.90 ± 16.78		4768.95 ± 23.25	4760.57 ± 23.25	4769.09 ± 23.26
Qwen/Qwen3.5-122B-A10B-FP8	ctx_tg @ d16384	31.14 ± 0.81	31.78 ± 1.10
Qwen/Qwen3.5-122B-A10B-FP8	pp2048 @ d16384	1424.31 ± 4.03		1446.28 ± 4.06	1437.90 ± 4.06	1446.50 ± 4.06
Qwen/Qwen3.5-122B-A10B-FP8	tg32 @ d16384	30.59 ± 0.09	31.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8	ctx_pp @ d65535	2877.89 ± 16.82		22781.44 ± 132.84	22773.05 ± 132.84	22781.57 ± 132.92
Qwen/Qwen3.5-122B-A10B-FP8	ctx_tg @ d65535	29.00 ± 0.09	29.67 ± 0.47
Qwen/Qwen3.5-122B-A10B-FP8	pp2048 @ d65535	1223.28 ± 22.50		1683.15 ± 31.21	1674.77 ± 31.21	1683.27 ± 31.18
Qwen/Qwen3.5-122B-A10B-FP8	tg32 @ d65535	29.13 ± 0.84	29.67 ± 0.94
Qwen/Qwen3.5-122B-A10B-FP8	ctx_pp @ d100000	2443.81 ± 78.48		40971.69 ± 1344.62	40963.31 ± 1344.62	40971.91 ± 1344.73
Qwen/Qwen3.5-122B-A10B-FP8	ctx_tg @ d100000	28.49 ± 0.41	29.33 ± 0.47
Qwen/Qwen3.5-122B-A10B-FP8	pp2048 @ d100000	830.49 ± 12.11		2474.92 ± 36.20	2466.53 ± 36.20	2475.06 ± 36.20
Qwen/Qwen3.5-122B-A10B-FP8	tg32 @ d100000	28.03 ± 0.11	29.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8	ctx_pp @ d200000	1835.82 ± 11.99		108956.87 ± 711.41	108948.49 ± 711.41	108957.28 ± 711.62
Qwen/Qwen3.5-122B-A10B-FP8	ctx_tg @ d200000	24.93 ± 0.06	26.00 ± 0.00
Qwen/Qwen3.5-122B-A10B-FP8	pp2048 @ d200000	609.40 ± 5.58		3369.37 ± 30.79	3360.98 ± 30.79	3370.14 ± 30.12
Qwen/Qwen3.5-122B-A10B-FP8	tg32 @ d200000	24.41 ± 0.20	25.00 ± 0.00

llama-benchy (0.3.1)
date: 2026-02-25 12:27:44 | latency mode: api

cosinus · February 25, 2026, 9:08pm

Seems that the AWQ versions need a patch. The vllm-node version failed with the rope utils for me, standard nightly worked.

# upgrade transformers so that applications could properly execute tool calls
pip install -U "transformers @ git+https://github.com/huggingface/transformers.git@f2ba019"
# locate modeling_rope_utils.py line 651 to fix a simple bug
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && echo "$TF_FILE"
NEW_LINE='            ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"

Found that on QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face

and as deps

vllm>=0.16.0rc2.dev447
transformers>=5.3.0.dev0

Will look into that tomorrow.

eugr · February 25, 2026, 9:09pm

ah, so need to run this with transformers 5 then.
need to build vllm-node with --tf5 flag

ptichalouf · February 25, 2026, 9:30pm

i try with that and report back, i try with –tf5 flag for the build of vllm-node :)

Seems that the AWQ versions need a patch. The vllm-node version failed with the rope utils for me, standard nightly worked.

# upgrade transformers so that applications could properly execute tool calls
pip install -U "transformers @ git+https://github.com/huggingface/transformers.git@f2ba019"
# locate modeling_rope_utils.py line 651 to fix a simple bug
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && echo "$TF_FILE"
NEW_LINE='            ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"

Found that on QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face

and as deps

vllm>=0.16.0rc2.dev447
transformers>=5.3.0.dev0

Will look into that tomorrow.

stefan132 · February 25, 2026, 9:30pm

Simple stupid question: what does that mean for eugr/spark-vllm-docker and ./build-and-copy.sh?

./build-and-copy.sh --tf5 ??

Topic		Replies	Views
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	10550	April 9, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	408	18004	May 26, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	9706	March 24, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	235	8663	May 23, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	5882	May 4, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	262	22810	May 30, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	56	5403	April 13, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5658	March 16, 2026
Qwen3.6-27B is out! DGX Spark / GB10 agentic-ai	263	21221	May 29, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	12173	May 15, 2026

Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D

Related topics