Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D

eugr · February 24, 2026, 9:42pm

Yes, 27B is a dense one, so I wouldn’t expect more than ~15-16 t/s on a single Spark for 4-bit quant.

eugr · February 24, 2026, 9:58pm

He just updated the weights and renamed the model: cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 · Hugging Face

cosinus · February 24, 2026, 10:25pm

Dropping off for today (nearing midnight over here in Germany). May be the Alibabas dropping the FP8 over night. :-D

First benchmarks looking good:

ptichalouf · February 24, 2026, 11:14pm

i try this one personnaly txn545/Qwen3.5-122B-A10B-NVFP4 · Hugging Face

arctic.gus · February 24, 2026, 11:22pm

Gonna wait for that INT4 + Autoround quant here.

ptichalouf · February 24, 2026, 11:28pm

Shape error in the weights: torch.Size([256, 3072]) vs torch.Size([256, 1536]). The NVFP4 checkpoint of txn545 appears to be incorrectly quantized (probably a TP=2 vs TP=1 issue).

andrewc_actual · February 24, 2026, 11:32pm

unsloth ud-q4-k-xl running fine on my dgx spark. using about 87 GB of ram and doing about 24 t/s.

works fine with opencode after I pulled and compiled the latest llama.cpp , had some tool call errors before i did that.

from a couple quick tests it seems smart but spent a lot of time thinking and making tool calls compared to glm 4.7 flash q8 - which is my goto at the moment.

HF_MODEL="unsloth/Qwen3.5-122B-A10B-GGUF:UD-Q4_K_XL"


./llama.cpp/build-cuda/bin/llama-server \
  -hf "$HF_MODEL" \
  --ctx-size 0 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.01 \
  --batch-size 4096 \
  --ubatch-size 4096 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

note that i jammed this config together based on the unsloth notes and some of my other configs. probably not optimal.

looks like q8 would probably be too big for a single spark unfortunately.

alexander.kachur · February 25, 2026, 2:13am

Shameless plug: vLLM friendly MXFP4 quant here:

What’s the current state of mxfp4 kernels vs nvfp4? I’ve noticed a lot of progress but it’s hard to find tech summary on improvements

Eugr could you please point me in a right direction?

jwarner · February 25, 2026, 4:27am

New NVFP4 option Sehyo/Qwen3.5-122B-A10B-NVFP4 · Hugging Face

giraudremi92 · February 25, 2026, 8:20am

I am waiting for the official FP8 :)

You will find my test on spark arena here :

dbsci · February 25, 2026, 4:18pm

FP8s are out!

yay!

cosinus · February 25, 2026, 5:14pm

35B-A3B und 122-A10B are also available as AWQ by cyanwiki by now.

Tested the 35B-A3B - seems to work. I tested against the container version of the nightly build on a RTX 5090. Will give the 122-A10B now a try. Might take some time to be downloaded.

giraudremi92 · February 25, 2026, 5:19pm

Awesome

eugr · February 25, 2026, 5:28pm

AWQ quant is not performing well for some reason. This is what I got yesterday:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit	pp2048	4930.29 ± 51.08		419.89 ± 4.14	415.57 ± 4.14	419.99 ± 4.13
cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit	tg32	39.76 ± 0.02	41.07 ± 0.05

llama-benchy (0.3.1)
date: 2026-02-24 23:11:38 | latency mode: api

Will try FP8 version soon.

cosinus · February 25, 2026, 5:31pm

With MTP enabled? If this even matters for the bench.

giraudremi92 · February 25, 2026, 5:36pm

@eugr @raphael.amorim
Anyone has a recipe for FP8 model ?

eugr · February 25, 2026, 5:41pm

No, haven’t tried MTP yet. I did try it on cyankiwi quant for 27B model, and the performance was about the same as without it.

eugr · February 25, 2026, 5:41pm

Soon, but you should be able to run it like this now:

./launch-cluster.sh --solo exec vllm serve Qwen/Qwen3.5-35B-A3B-FP8 --max-model-len 262144 --gpu-memory-utilization 0.7 --port 8888 --host 0.0.0.0 --load-format fastsafetensors --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder

eugr · February 25, 2026, 5:44pm

This one performs better:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3.5-35B-A3B-FP8	pp2048	4089.12 ± 113.21		505.46 ± 14.15	501.48 ± 14.15	505.59 ± 14.20
Qwen/Qwen3.5-35B-A3B-FP8	tg32	50.31 ± 0.90	51.95 ± 0.94

llama-benchy (0.3.1)
date: 2026-02-25 09:43:58 | latency mode: api

giraudremi92 · February 25, 2026, 5:45pm

FP8 seems the best tradeoff for each models on the spark

Topic		Replies	Views
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	174	3983	March 27, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	7513	March 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	29	1931	March 26, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	33	5691	March 11, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	48	3377	March 8, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	3704	March 16, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	3453	March 6, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	5339	March 14, 2026
Introducing the Spark Arena DGX Spark / GB10	124	3819	March 24, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	393	March 3, 2026

Related topics