Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D

Yes, 27B is a dense one, so I wouldn’t expect more than ~15-16 t/s on a single Spark for 4-bit quant.

He just updated the weights and renamed the model: cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 · Hugging Face

Dropping off for today (nearing midnight over here in Germany). May be the Alibabas dropping the FP8 over night. :-D

First benchmarks looking good:

i try this one personnaly txn545/Qwen3.5-122B-A10B-NVFP4 · Hugging Face

1 Like

Gonna wait for that INT4 + Autoround quant here.

3 Likes

Shape error in the weights: torch.Size([256, 3072]) vs torch.Size([256, 1536]). The NVFP4 checkpoint of txn545 appears to be incorrectly quantized (probably a TP=2 vs TP=1 issue).

unsloth ud-q4-k-xl running fine on my dgx spark. using about 87 GB of ram and doing about 24 t/s.

works fine with opencode after I pulled and compiled the latest llama.cpp , had some tool call errors before i did that.

from a couple quick tests it seems smart but spent a lot of time thinking and making tool calls compared to glm 4.7 flash q8 - which is my goto at the moment.

HF_MODEL="unsloth/Qwen3.5-122B-A10B-GGUF:UD-Q4_K_XL"


./llama.cpp/build-cuda/bin/llama-server \
  -hf "$HF_MODEL" \
  --ctx-size 0 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.01 \
  --batch-size 4096 \
  --ubatch-size 4096 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

note that i jammed this config together based on the unsloth notes and some of my other configs. probably not optimal.

looks like q8 would probably be too big for a single spark unfortunately.

Shameless plug: vLLM friendly MXFP4 quant here:

What’s the current state of mxfp4 kernels vs nvfp4? I’ve noticed a lot of progress but it’s hard to find tech summary on improvements

Eugr could you please point me in a right direction?

1 Like

New NVFP4 option Sehyo/Qwen3.5-122B-A10B-NVFP4 · Hugging Face

4 Likes

I am waiting for the official FP8 :)

You will find my test on spark arena here :

3 Likes

FP8s are out!

yay!

5 Likes

35B-A3B und 122-A10B are also available as AWQ by cyanwiki by now.

Tested the 35B-A3B - seems to work. I tested against the container version of the nightly build on a RTX 5090. Will give the 122-A10B now a try. Might take some time to be downloaded.

Awesome

AWQ quant is not performing well for some reason. This is what I got yesterday:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit pp2048 4930.29 ± 51.08 419.89 ± 4.14 415.57 ± 4.14 419.99 ± 4.13
cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit tg32 39.76 ± 0.02 41.07 ± 0.05

llama-benchy (0.3.1)
date: 2026-02-24 23:11:38 | latency mode: api

Will try FP8 version soon.

With MTP enabled? If this even matters for the bench.

@eugr @raphael.amorim
Anyone has a recipe for FP8 model ?

No, haven’t tried MTP yet. I did try it on cyankiwi quant for 27B model, and the performance was about the same as without it.

Soon, but you should be able to run it like this now:

./launch-cluster.sh --solo exec vllm serve Qwen/Qwen3.5-35B-A3B-FP8 --max-model-len 262144 --gpu-memory-utilization 0.7 --port 8888 --host 0.0.0.0 --load-format fastsafetensors --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder
2 Likes

This one performs better:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.5-35B-A3B-FP8 pp2048 4089.12 ± 113.21 505.46 ± 14.15 501.48 ± 14.15 505.59 ± 14.20
Qwen/Qwen3.5-35B-A3B-FP8 tg32 50.31 ± 0.90 51.95 ± 0.94

llama-benchy (0.3.1)
date: 2026-02-25 09:43:58 | latency mode: api

2 Likes

FP8 seems the best tradeoff for each models on the spark