And yet another candidate to be tested. Still struggling to get stable performance in terms of tool calling with Qwen3.5 (may be I just missed a fix) and/or Gemma4… and they pushed out already the next. 😅
Qwen3.6 Highlights
This release delivers substantial upgrades, particularly in
Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
Thinking Preservation: we’ve introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.
Sounds very promising. Qwen3_5MoeForConditionalGeneration - same architecture as 3.5 - so vLLM should be ready too.
Hope they release a 3.6 - 122B Version, the 3.5 one has been running great so far.
I’m curious if this suffers from the hypothesized AdamW weight scaling “bug” like the 3.5 version.
Looking forward to testing it out. Might even try front running Intel with an Autoround quant.
It’ll also be interesting to see if the 3.5 DFlash is usable with 3.6 - especially once DDTree becomes available.
They did a vote on X. The majority voted for the dense 27B.
I voted for 122B btw. 😉
So I’m surprised that they start with the 35B.
Here we go for 2x DGX Spark performance (revised):
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--host 0.0.0.0 \
--port 8080 \
--gpu-memory-utilization 0.8 \
--max-model-len 262144 \
--max-num-batched-tokens 8192 \
--max-num-seqs 4 \
--enable-prefix-caching \
--enable-chunked-prefill \
--attention-backend flashinfer \
--load-format instanttensor \
--trust-remote-code \
--dtype auto \
--kv-cache-dtype fp8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray
Benchmarks:
100% successful completion at ToolCall-15 .
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-------------------------|-----------------:|-----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 | 7824.25 ± 162.29 | | 263.59 ± 5.42 | 261.95 ± 5.42 | 263.65 ± 5.42 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 | 77.74 ± 0.44 | 78.33 ± 0.47 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d4096 | 8496.23 ± 73.66 | | 724.88 ± 6.36 | 723.24 ± 6.36 | 724.95 ± 6.36 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d4096 | 76.44 ± 0.09 | 77.00 ± 0.00 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d8192 | 8403.24 ± 38.07 | | 1220.28 ± 5.59 | 1218.64 ± 5.59 | 1220.35 ± 5.59 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d8192 | 75.76 ± 0.07 | 76.00 ± 0.00 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d16384 | 8217.19 ± 12.29 | | 2244.87 ± 3.36 | 2243.23 ± 3.36 | 2244.93 ± 3.37 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d16384 | 74.79 ± 0.08 | 75.33 ± 0.47 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d32768 | 7433.69 ± 7.82 | | 4685.37 ± 4.98 | 4683.73 ± 4.98 | 4685.42 ± 4.97 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d32768 | 73.40 ± 0.07 | 74.00 ± 0.00 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d65536 | 6310.26 ± 8.14 | | 10712.00 ± 13.83 | 10710.35 ± 13.83 | 10712.06 ± 13.84 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d65536 | 69.90 ± 0.04 | 71.00 ± 0.00 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d131072 | 4672.69 ± 15.40 | | 28491.11 ± 93.91 | 28489.47 ± 93.91 | 28491.18 ± 93.92 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d131072 | 64.28 ± 0.41 | 65.33 ± 0.47 | | | |
llama-benchy (0.3.5)
date: 2026-04-16 17:59:04 | latency mode: api
First quick benchmark on my side. I compared R1/R2 against my Qwen3.5 baseline.
The pp128 results look clearly better in most concurrency settings, while tg256 is more mixed at higher concurrency.
Concurrency
pp128 R1
pp128 R2
pp128 Qwen3.5
Δ R2 vs Qwen3.5
tg256 R1
tg256 R2
tg256 Qwen3.5
Δ R2 vs Qwen3.5
c1
768*
1029
1013
+2%
49.3
48.3
45.5
+6%
c4
1592
2321
1662
+40%
122.9
115.8
-–
-–
c8
2729
2635
2342
+13%
168.5
164.6
-–
-–
c16
3702
3646
3262
+12%
221.7
215.2
222.9
-3%
c24
4230
3929
3663
+7%
283.0
242.4
264.8
-8%
* The c1 pp128 R1 value looks like a warm-up/outlier run.
Dude. Did you loosen the handbrake?
model
test
t/s
peak t/s
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
Qwen/Qwen3.6-35B-A3B-FP8
pp2048
4037.44 ± 1512.51
628.56 ± 319.65
627.25 ± 319.65
628.65 ± 319.68
Qwen/Qwen3.6-35B-A3B-FP8
tg32
52.73 ± 0.08
54.44 ± 0.08
Qwen/Qwen3.6-35B-A3B-FP8
pp2048 @ d1024
4445.84 ± 710.55
713.11 ± 128.38
711.80 ± 128.38
713.22 ± 128.37
Qwen/Qwen3.6-35B-A3B-FP8
tg32 @ d1024
52.39 ± 0.08
54.09 ± 0.08
Qwen/Qwen3.6-35B-A3B-FP8
pp2048 @ d2048
5346.63 ± 53.67
767.47 ± 7.64
766.17 ± 7.64
767.57 ± 7.65
Qwen/Qwen3.6-35B-A3B-FP8
tg32 @ d2048
52.26 ± 0.13
53.96 ± 0.13
Qwen/Qwen3.6-35B-A3B-FP8
pp2048 @ d4096
5585.74 ± 306.96
1104.89 ± 63.03
1103.59 ± 63.03
1104.99 ± 63.06
Qwen/Qwen3.6-35B-A3B-FP8
tg32 @ d4096
52.31 ± 0.16
54.01 ± 0.17
Qwen/Qwen3.6-35B-A3B-FP8
pp2048 @ d8192
6212.15 ± 31.00
1649.78 ± 8.33
1648.48 ± 8.33
1649.89 ± 8.33
Qwen/Qwen3.6-35B-A3B-FP8
tg32 @ d8192
52.34 ± 0.09
54.04 ± 0.09
Qwen/Qwen3.6-35B-A3B-FP8
pp2048 @ d16384
5333.65 ± 16.42
3457.38 ± 10.60
3456.08 ± 10.60
3457.49 ± 10.61
Qwen/Qwen3.6-35B-A3B-FP8
tg32 @ d16384
52.15 ± 0.05
53.84 ± 0.05
Single Spark. Slightly modified 3.5 recipe by eugr (took the template out and the mods for testing).
# Recipe: Qwen/Qwen3.6-35B-A3B-FP8
# Qwen/Qwen3.6-35B-A3B model in native FP8 format
recipe_version: "1"
name: Qwen35-35B-A3B
description: vLLM serving Qwen3.6-35B-A3B-FP8
# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8
solo_only: true
# Container image to use
container: vllm-node
# Mods
mods: []
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.7
max_model_len: 262144
max_num_batched_tokens: 16384
# Environment variables
env:
VLLM_MARLIN_USE_ATOMIC_ADD: 1
# The vLLM serve command template
command: |
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--host {host} \
--port {port} \
--max-model-len {max_model_len} \
--max-num-batched-tokens {max_num_batched_tokens} \
--gpu-memory-utilization {gpu_memory_utilization} \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--kv-cache-dtype fp8 \
--load-format fastsafetensors \
--attention-backend flashinfer \
--enable-prefix-caching
vLLM 0.19.1rc1.dev337+g17d87168d.d20260416 (just did a rebuild with latest wheels).
Toolcall-15 looks interesting. Will have a look into that.
Exactly what I am running and building now… you beat me to it. :-)
I do see you have tensor_parallel: 1 but removed it out of the command -tp {tensor_parallel} \
Got only 97%:
Qwen/Qwen3.6-35B-A3B-FP8: 97/100 (29/30) ★★★★★ Excellent with my approach.
-tp makes only sense if you use more than one GPU. For just one GPU it is not needed.
Not sure what happened, but here we go:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-------------------------|---------------:|------------------:|-------------:|----------------:|----------------:|----------------:|
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 | 5689.70 ± 2687.84 | | 542.93 ± 384.18 | 541.61 ± 384.18 | 543.00 ± 384.18 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 | 75.11 ± 3.16 | 78.00 ± 0.00 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d4096 | 7387.81 ± 1504.92 | | 875.42 ± 208.01 | 874.10 ± 208.01 | 875.49 ± 208.01 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d4096 | 76.43 ± 0.05 | 77.00 ± 0.00 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d8192 | 8351.93 ± 24.08 | | 1227.52 ± 3.63 | 1226.19 ± 3.63 | 1227.57 ± 3.64 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d8192 | 76.12 ± 0.16 | 77.00 ± 0.00 | | | |
llama-benchy (0.3.5)
date: 2026-04-16 17:58:00 | latency mode: api
EDIT:
The delta to my previously very unimpressive result is that I took out MTP: --speculative-config '{"method":"mtp","num_speculative_tokens":2}' – I am testing this but didn’t get a performance speed-up with it. Quite the contrary.
OK. Test of VS Code CoPilot failed due to an ongoing GitHub incident:
Update - We found an issue that impacts 70% of Codespaces. We are engaged with the provider and working towards mitigation.
Apr 16, 2026 - 15:49 UTC
Update - We are experiencing degraded performance in Codespaces related to creating a new Codespace or starting an existing Codespace from the VS Code editor. SSH connections to Codespaces are not impacted. We are working toward mitigation and will continue to keep you updated on progress.
Falling back to Claude Code / opencode for testing.
…Why do I need Codespaces?
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' may have been the culprit. Did you get MTP to work?
Funnily enough it is now failing TC14 – 97% at ToolCall-15.
awesome i need to benchmark it !
What a time to have local AI haha
The outage is definitely not restricted to codespaces. Lots of VS Code stuff has been broken on and off all day (I filed an issue earlier ).
I have some evals running on this model and will update GitHub - DanTup/spark-evals: Some benchmark results of small models and quants that fit on DGX Spark · GitHub as they complete (they’re going to take many many hours though).
dbsci
April 16, 2026, 4:54pm
16
Still working on iteration to figure out the best recipe, but recipe is added to Spark Arena repos and usable via sparkrun .
Update Registries
sparkrun update
Run Recipe
sparkrun run qwen3.6-35b-a3b-fp8-vllm
Feel free to suggest improvements to the official recipe via PR: GitHub - spark-arena/recipe-registry: Official Spark Arena Recipe Registry · GitHub .
Work on quants & MTP are ongoing.
The sparkrun recipe won’t be groundbreaking to the regulars here, but it’s there to make it easily accessible.
Wow. It works flawlessly here. The code seems really good. When I enter two prompts, I get this. The only issue is that German spelling isn’t perfect with these model sizes. The same problem occurs with the “Queen 3.5” models as well. The larger models can spell German correctly.
Qwen/Qwen3.6-35B-A3B-FP8 · Hugging Face Love how they are throwing in all of the Claw & Agent benchmarks now as well :-)
Would --tensor-parallel-size 8 help as per the readme? I tried it and crashed.
Poor ticket system… don’t cry… the Jens is cheating with AI Agents…
OpenCode seems to like this model, too.