Yes, 27B is a dense one, so I wouldn’t expect more than ~15-16 t/s on a single Spark for 4-bit quant.
He just updated the weights and renamed the model: cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 · Hugging Face
Dropping off for today (nearing midnight over here in Germany). May be the Alibabas dropping the FP8 over night. :-D
First benchmarks looking good:
i try this one personnaly txn545/Qwen3.5-122B-A10B-NVFP4 · Hugging Face
Gonna wait for that INT4 + Autoround quant here.
Shape error in the weights: torch.Size([256, 3072]) vs torch.Size([256, 1536]). The NVFP4 checkpoint of txn545 appears to be incorrectly quantized (probably a TP=2 vs TP=1 issue).
unsloth ud-q4-k-xl running fine on my dgx spark. using about 87 GB of ram and doing about 24 t/s.
works fine with opencode after I pulled and compiled the latest llama.cpp , had some tool call errors before i did that.
from a couple quick tests it seems smart but spent a lot of time thinking and making tool calls compared to glm 4.7 flash q8 - which is my goto at the moment.
HF_MODEL="unsloth/Qwen3.5-122B-A10B-GGUF:UD-Q4_K_XL"
./llama.cpp/build-cuda/bin/llama-server \
-hf "$HF_MODEL" \
--ctx-size 0 \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.01 \
--batch-size 4096 \
--ubatch-size 4096 \
-np 1 \
--host 0.0.0.0 \
--port 8080
note that i jammed this config together based on the unsloth notes and some of my other configs. probably not optimal.
looks like q8 would probably be too big for a single spark unfortunately.
Shameless plug: vLLM friendly MXFP4 quant here:
What’s the current state of mxfp4 kernels vs nvfp4? I’ve noticed a lot of progress but it’s hard to find tech summary on improvements
Eugr could you please point me in a right direction?
New NVFP4 option Sehyo/Qwen3.5-122B-A10B-NVFP4 · Hugging Face
FP8s are out!
yay!
35B-A3B und 122-A10B are also available as AWQ by cyanwiki by now.
Tested the 35B-A3B - seems to work. I tested against the container version of the nightly build on a RTX 5090. Will give the 122-A10B now a try. Might take some time to be downloaded.
Awesome
AWQ quant is not performing well for some reason. This is what I got yesterday:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit | pp2048 | 4930.29 ± 51.08 | 419.89 ± 4.14 | 415.57 ± 4.14 | 419.99 ± 4.13 | |
| cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit | tg32 | 39.76 ± 0.02 | 41.07 ± 0.05 |
llama-benchy (0.3.1)
date: 2026-02-24 23:11:38 | latency mode: api
Will try FP8 version soon.
With MTP enabled? If this even matters for the bench.
@eugr @raphael.amorim
Anyone has a recipe for FP8 model ?
No, haven’t tried MTP yet. I did try it on cyankiwi quant for 27B model, and the performance was about the same as without it.
Soon, but you should be able to run it like this now:
./launch-cluster.sh --solo exec vllm serve Qwen/Qwen3.5-35B-A3B-FP8 --max-model-len 262144 --gpu-memory-utilization 0.7 --port 8888 --host 0.0.0.0 --load-format fastsafetensors --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder
This one performs better:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| Qwen/Qwen3.5-35B-A3B-FP8 | pp2048 | 4089.12 ± 113.21 | 505.46 ± 14.15 | 501.48 ± 14.15 | 505.59 ± 14.20 | |
| Qwen/Qwen3.5-35B-A3B-FP8 | tg32 | 50.31 ± 0.90 | 51.95 ± 0.94 |
llama-benchy (0.3.1)
date: 2026-02-25 09:43:58 | latency mode: api
FP8 seems the best tradeoff for each models on the spark
