Qwen/Qwen3.6-35B-A3B (and FP8) has landed

And yet another candidate to be tested. Still struggling to get stable performance in terms of tool calling with Qwen3.5 (may be I just missed a fix) and/or Gemma4… and they pushed out already the next. 😅

Qwen3.6 Highlights

This release delivers substantial upgrades, particularly in

  • Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
  • Thinking Preservation: we’ve introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.

Sounds very promising. Qwen3_5MoeForConditionalGeneration - same architecture as 3.5 - so vLLM should be ready too.

Hope they release a 3.6 - 122B Version, the 3.5 one has been running great so far.

I’m curious if this suffers from the hypothesized AdamW weight scaling “bug” like the 3.5 version.

Looking forward to testing it out. Might even try front running Intel with an Autoround quant.

It’ll also be interesting to see if the 3.5 DFlash is usable with 3.6 - especially once DDTree becomes available.

They did a vote on X. The majority voted for the dense 27B.

I voted for 122B btw. 😉

So I’m surprised that they start with the 35B.

Here we go for 2x DGX Spark performance (revised):

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --host 0.0.0.0 \
    --port 8080 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 4 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --attention-backend flashinfer \
    --load-format instanttensor \
    --trust-remote-code \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray

Benchmarks:
100% successful completion at ToolCall-15.

| model                    |             test |              t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:-------------------------|-----------------:|-----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| Qwen/Qwen3.6-35B-A3B-FP8 |           pp2048 | 7824.25 ± 162.29 |              |    263.59 ± 5.42 |    261.95 ± 5.42 |    263.65 ± 5.42 |
| Qwen/Qwen3.6-35B-A3B-FP8 |            tg128 |     77.74 ± 0.44 | 78.33 ± 0.47 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   pp2048 @ d4096 |  8496.23 ± 73.66 |              |    724.88 ± 6.36 |    723.24 ± 6.36 |    724.95 ± 6.36 |
| Qwen/Qwen3.6-35B-A3B-FP8 |    tg128 @ d4096 |     76.44 ± 0.09 | 77.00 ± 0.00 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   pp2048 @ d8192 |  8403.24 ± 38.07 |              |   1220.28 ± 5.59 |   1218.64 ± 5.59 |   1220.35 ± 5.59 |
| Qwen/Qwen3.6-35B-A3B-FP8 |    tg128 @ d8192 |     75.76 ± 0.07 | 76.00 ± 0.00 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d16384 |  8217.19 ± 12.29 |              |   2244.87 ± 3.36 |   2243.23 ± 3.36 |   2244.93 ± 3.37 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   tg128 @ d16384 |     74.79 ± 0.08 | 75.33 ± 0.47 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d32768 |   7433.69 ± 7.82 |              |   4685.37 ± 4.98 |   4683.73 ± 4.98 |   4685.42 ± 4.97 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   tg128 @ d32768 |     73.40 ± 0.07 | 74.00 ± 0.00 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d65536 |   6310.26 ± 8.14 |              | 10712.00 ± 13.83 | 10710.35 ± 13.83 | 10712.06 ± 13.84 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   tg128 @ d65536 |     69.90 ± 0.04 | 71.00 ± 0.00 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d131072 |  4672.69 ± 15.40 |              | 28491.11 ± 93.91 | 28489.47 ± 93.91 | 28491.18 ± 93.92 |
| Qwen/Qwen3.6-35B-A3B-FP8 |  tg128 @ d131072 |     64.28 ± 0.41 | 65.33 ± 0.47 |                  |                  |                  |

llama-benchy (0.3.5)
date: 2026-04-16 17:59:04 | latency mode: api

First quick benchmark on my side. I compared R1/R2 against my Qwen3.5 baseline.
The pp128 results look clearly better in most concurrency settings, while tg256 is more mixed at higher concurrency.

Concurrency pp128 R1 pp128 R2 pp128 Qwen3.5 Δ R2 vs Qwen3.5 tg256 R1 tg256 R2 tg256 Qwen3.5 Δ R2 vs Qwen3.5
c1 768* 1029 1013 +2% 49.3 48.3 45.5 +6%
c4 1592 2321 1662 +40% 122.9 115.8 -– -–
c8 2729 2635 2342 +13% 168.5 164.6 -– -–
c16 3702 3646 3262 +12% 221.7 215.2 222.9 -3%
c24 4230 3929 3663 +7% 283.0 242.4 264.8 -8%

* The c1 pp128 R1 value looks like a warm-up/outlier run.

Dude. Did you loosen the handbrake?

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.6-35B-A3B-FP8 pp2048 4037.44 ± 1512.51 628.56 ± 319.65 627.25 ± 319.65 628.65 ± 319.68
Qwen/Qwen3.6-35B-A3B-FP8 tg32 52.73 ± 0.08 54.44 ± 0.08
Qwen/Qwen3.6-35B-A3B-FP8 pp2048 @ d1024 4445.84 ± 710.55 713.11 ± 128.38 711.80 ± 128.38 713.22 ± 128.37
Qwen/Qwen3.6-35B-A3B-FP8 tg32 @ d1024 52.39 ± 0.08 54.09 ± 0.08
Qwen/Qwen3.6-35B-A3B-FP8 pp2048 @ d2048 5346.63 ± 53.67 767.47 ± 7.64 766.17 ± 7.64 767.57 ± 7.65
Qwen/Qwen3.6-35B-A3B-FP8 tg32 @ d2048 52.26 ± 0.13 53.96 ± 0.13
Qwen/Qwen3.6-35B-A3B-FP8 pp2048 @ d4096 5585.74 ± 306.96 1104.89 ± 63.03 1103.59 ± 63.03 1104.99 ± 63.06
Qwen/Qwen3.6-35B-A3B-FP8 tg32 @ d4096 52.31 ± 0.16 54.01 ± 0.17
Qwen/Qwen3.6-35B-A3B-FP8 pp2048 @ d8192 6212.15 ± 31.00 1649.78 ± 8.33 1648.48 ± 8.33 1649.89 ± 8.33
Qwen/Qwen3.6-35B-A3B-FP8 tg32 @ d8192 52.34 ± 0.09 54.04 ± 0.09
Qwen/Qwen3.6-35B-A3B-FP8 pp2048 @ d16384 5333.65 ± 16.42 3457.38 ± 10.60 3456.08 ± 10.60 3457.49 ± 10.61
Qwen/Qwen3.6-35B-A3B-FP8 tg32 @ d16384 52.15 ± 0.05 53.84 ± 0.05

Single Spark. Slightly modified 3.5 recipe by eugr (took the template out and the mods for testing).

# Recipe: Qwen/Qwen3.6-35B-A3B-FP8
# Qwen/Qwen3.6-35B-A3B model in native FP8 format


recipe_version: "1"
name: Qwen35-35B-A3B
description: vLLM serving Qwen3.6-35B-A3B-FP8

# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8

solo_only: true

# Container image to use
container: vllm-node

# Mods
mods: []

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 16384

# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

# The vLLM serve command template
command: |
  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --host {host} \
    --port {port} \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --kv-cache-dtype fp8 \
    --load-format fastsafetensors \
    --attention-backend flashinfer \
    --enable-prefix-caching

vLLM 0.19.1rc1.dev337+g17d87168d.d20260416 (just did a rebuild with latest wheels).

Toolcall-15 looks interesting. Will have a look into that.

Exactly what I am running and building now… you beat me to it. :-)

I do see you have tensor_parallel: 1 but removed it out of the command -tp {tensor_parallel} \

Got only 97%:

Qwen/Qwen3.6-35B-A3B-FP8: 97/100 (29/30) ★★★★★ Excellent with my approach.

-tp makes only sense if you use more than one GPU. For just one GPU it is not needed.

Not sure what happened, but here we go:

| model                    |           test |               t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:-------------------------|---------------:|------------------:|-------------:|----------------:|----------------:|----------------:|
| Qwen/Qwen3.6-35B-A3B-FP8 |         pp2048 | 5689.70 ± 2687.84 |              | 542.93 ± 384.18 | 541.61 ± 384.18 | 543.00 ± 384.18 |
| Qwen/Qwen3.6-35B-A3B-FP8 |          tg128 |      75.11 ± 3.16 | 78.00 ± 0.00 |                 |                 |                 |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d4096 | 7387.81 ± 1504.92 |              | 875.42 ± 208.01 | 874.10 ± 208.01 | 875.49 ± 208.01 |
| Qwen/Qwen3.6-35B-A3B-FP8 |  tg128 @ d4096 |      76.43 ± 0.05 | 77.00 ± 0.00 |                 |                 |                 |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d8192 |   8351.93 ± 24.08 |              |  1227.52 ± 3.63 |  1226.19 ± 3.63 |  1227.57 ± 3.64 |
| Qwen/Qwen3.6-35B-A3B-FP8 |  tg128 @ d8192 |      76.12 ± 0.16 | 77.00 ± 0.00 |                 |                 |                 |

llama-benchy (0.3.5)
date: 2026-04-16 17:58:00 | latency mode: api

EDIT:
The delta to my previously very unimpressive result is that I took out MTP: --speculative-config '{"method":"mtp","num_speculative_tokens":2}' – I am testing this but didn’t get a performance speed-up with it. Quite the contrary.

OK. Test of VS Code CoPilot failed due to an ongoing GitHub incident:

Update - We found an issue that impacts 70% of Codespaces. We are engaged with the provider and working towards mitigation.
Apr 16, 2026 - 15:49 UTC

Update - We are experiencing degraded performance in Codespaces related to creating a new Codespace or starting an existing Codespace from the VS Code editor. SSH connections to Codespaces are not impacted. We are working toward mitigation and will continue to keep you updated on progress.

Falling back to Claude Code / opencode for testing.

…Why do I need Codespaces?

--speculative-config '{"method":"mtp","num_speculative_tokens":2}' may have been the culprit. Did you get MTP to work?

Funnily enough it is now failing TC14 – 97% at ToolCall-15.

awesome i need to benchmark it !
What a time to have local AI haha

The outage is definitely not restricted to codespaces. Lots of VS Code stuff has been broken on and off all day (I filed an issue earlier).

I have some evals running on this model and will update GitHub - DanTup/spark-evals: Some benchmark results of small models and quants that fit on DGX Spark · GitHub as they complete (they’re going to take many many hours though).

Still working on iteration to figure out the best recipe, but recipe is added to Spark Arena repos and usable via sparkrun.

Update Registries
sparkrun update

Run Recipe
sparkrun run qwen3.6-35b-a3b-fp8-vllm

Feel free to suggest improvements to the official recipe via PR: GitHub - spark-arena/recipe-registry: Official Spark Arena Recipe Registry · GitHub.

Work on quants & MTP are ongoing.

The sparkrun recipe won’t be groundbreaking to the regulars here, but it’s there to make it easily accessible.

Wow. It works flawlessly here. The code seems really good. When I enter two prompts, I get this. The only issue is that German spelling isn’t perfect with these model sizes. The same problem occurs with the “Queen 3.5” models as well. The larger models can spell German correctly.

Qwen/Qwen3.6-35B-A3B-FP8 · Hugging Face Love how they are throwing in all of the Claw & Agent benchmarks now as well :-)

Would --tensor-parallel-size 8 help as per the readme? I tried it and crashed.

Poor ticket system… don’t cry… the Jens is cheating with AI Agents…

OpenCode seems to like this model, too.