Qwen/Qwen3.6-35B-A3B (and FP8) has landed

btvd · April 27, 2026, 4:31pm

Got error:
vllm serve: error: argument --speculative-config/-sc: Value {“method”:“dflash”,“model”: “z-lab/Qwen3.6-35B-A3B-DFlash”, “num_speculative_tokens”:4} cannot be converted to <function loads at 0xed81d62993a0>.
What version of vLLM are you using with DFlash?

DColt · April 27, 2026, 4:36pm

I’m running @eugr community docker, updated with the latest from today

vllm: 0.19.2rc1.dev213+g9558f4390.d20260426

azampatti · April 27, 2026, 4:36pm

I had to manually replace ever “” , ‘’’ and - (quotes and dashes) everytime I copy-paste from this forum.

do that and should work

azampatti · April 27, 2026, 4:51pm

Thanks for this. DFLash is working 14tok/sec faster with DFlash than MPT in my workload. That’s substancial ! :) 65 to 80t/s

Parallel x4 is pretty much the same as before, but the single-session performance is welcomed :)

btvd · April 27, 2026, 5:23pm

Could not see the raise in t/s… investigating the differences in configs…

Digital_David · April 27, 2026, 11:23pm

Might want to scroll up and reread, Dflash configs and test have been completed previously. :=-)

btvd · April 28, 2026, 8:38am

Testing such config for exiting infinite loop thinking:
–override-generation-config ‘{{“temperature”: 0.6, “top_p”: 0.95, “top_k”: 20, “presence_penalty”: 0.0, “repetition_penalty”: 1.0, “min_p”: 0.2}}’ \
Bad news is that min_p is not yet supported by speculative decoding.

CHRISDGX · April 28, 2026, 12:16pm

Hello, we use latest nvidia vllm (26.04-py3) with docker on single spark with few parameters atm works fine ~51 token/s next step trying with MTP 2/3:

–gpu-memory-utilization 0.8

–defaut-chat-template-kwargs ‘{“enable_thinking”: false}’

–enable-auto-tool-choice

–tool-call-parser qwen3_coder

goosetroop · April 28, 2026, 11:43pm

Hey all — looking for some help from the dual-Spark crowd.

Running Qwen3.6-35B-A3B-FP8 on a 2× DGX Spark cluster (CX7 200G direct link, tp=2 over Ray) and consistently seeing ~67 tok/s single-stream decode. Saw @serapis quote 77.74 ± 0.44 tok/s on the same model + topology earlier in this thread — really hoping someone can help me figure out what I’m missing.

Numbers

Metric	Mine	post #5 reference
Prefill (pp2048-class)	7,920 tok/s	7,824 ± 162
Decode (tg128, dual-Spark)	66.8 tok/s	77.74 ± 0.44
Decode (tg128, tp=1 single)	53.6 tok/s	75.1 (post #11)

Methodology: 1.5K-token prompt + 128 decode tokens, ignore_eos=True, 5 runs, very tight error bars (~0.5 tok/s). Decode rate is inferred by subtracting short-decode time (tg8 vs tg128) to exclude prefill cost. Prefill matches the published number almost exactly — only decode is off.

Setup

Hardware: 2× DGX Spark, CX7 200G direct link (RoCE, NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1)
Container: eugr/spark-vllm-docker with the latest prebuilt wheel — prebuilt-vllm-current from 2026-04-28 (vLLM 0.20.1rc1.dev23+gde3da0b97), flashinfer 0.6.9
vLLM args (production recipe):

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 262144 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.7 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --kv-cache-dtype fp8 \
  --load-format fastsafetensors \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --chat-template unsloth.jinja \
  -tp 2 \
  --distributed-executor-backend ray

What I’ve already tried (no decode change either way)

Stripped config — dropped tool-parser, reasoning-parser, prefix-caching, --chat-template. Lowered --max-model-len to 32768, --max-num-batched-tokens to 8192, --load-format=instanttensor. Same 67 tok/s decode.
vLLM bump — was on 0.19.2rc1.dev213+g9558f4390, rebuilt to 0.20.1rc1.dev23+gde3da0b97 (post-v0.20.0 stable). Same number.
tp=1 isolation — bare config on a single Spark: 53.6 tok/s decode. That’s 22 tok/s below the post #11 single-Spark figure, so the gap isn’t interconnect-related — it shows up on the single-node path too.

During steady-state decode I sampled nvidia-smi per 500 ms:

GPU clock: 2405 MHz (locked at max boost — no throttling)
Power draw: 21–33 W, average ~28 W (vs ~140 W TDP)

So SMs are sitting idle most of the time waiting on memory loads — it’s not a clock or thermal issue. Decode efficiency vs theoretical ceiling (~91 tok/s from 273 GB/s ÷ 3 GB-per-A3B-token):

Mine: 73%
Reference: 85%

That ~12% efficiency gap is real and I can’t reach it via config.

Question

What am I missing? Specifically:

Container/image — for those of you hitting 75+ on single-Spark or ~78 on dual, what container did you actually run? The thread pins vLLM args but not the build. Is anyone on NGC vLLM (nvcr.io/nvidia/vllm:26.03.post1-py3)? A specific upstream vLLM commit? A custom flashinfer build?
MoE kernel selection — anything I should check / pin / override? Any env vars beyond VLLM_MARLIN_USE_ATOMIC_ADD=1?
Power / firmware — is there a Spark BIOS or driver tweak (power profile, persistence mode, etc.) that meaningfully changes the picture?
Anything else — happy to run further bench variations and share results if it helps narrow it down.

Thanks in advance — really appreciate any pointers.

pnivek.dev · April 29, 2026, 3:09am

intel released the fixed version: Intel/Qwen3.6-35B-A3B-int4-mixed-AutoRound · Hugging Face

door.blu · April 29, 2026, 8:51am

Looks like a mistake.. FP8 size is 38Gb, this one (int4) - 43Gb.

mangosq · April 29, 2026, 8:56am

Lol how many times is Intel gonna mess up this quant? It’s been pulled again

jbourny · April 29, 2026, 3:03pm

Hello,

I can’t understand why no one is saying that Qwen3.6 or vLLM isn’t stable at all. I’ve tried all the vLLM versions and almost all the Qwen3.6 35B models (FP8, INT4, NVFP4 with and without distillation), and they all have the same problem: Endless repetition during reflections, the model starts creating a large file, then once it’s finished, it decides, “Oh, actually, no, I don’t like it, I’ll do it differently,” and this can go on for several times. The same thing happens when it corrects a file; it will correct it multiple times. Sometimes it will even reflect, say something, use a tool, reflect, say the same thing again, use the same tool with the same parameters, and so on dozens of times, even indefinitely.

I tried every possible setting, starting with the recommended one. The last one that seemed stable was `{repetition_penalty:1.1,temperature:0.4}`, but ultimately, after a while, around 30k contexts, it keeps repeating.

This happens with or without `preserve_thinking`, whether in Claude Code, VS Code, or even custom-built assistants.

My second ongoing issue, whether it’s vLLM Nightly or even the latest stable version vLLM 0.20 and vLLM eugr (which is based on the nightly version), is the tool calls. The tools end up as XML in the thinking_content output, forcing me to patch them everywhere. Qwen has been releasing templates for two years, and no one has been able to get vLLM working out of the box with this fix. I don’t understand…

Could someone please create a ZIP file containing all the patches or commands needed to make it work properly? I’m starting to despair :(

azampatti · April 29, 2026, 3:16pm

Have you tried a different chat template? I do see what you say in OpenWebUI when testing a few things, but I mostly use Claude Code (and OpenCode in minor things, but Claude Code is more stable for me than anything else, albeit slower) with FP8 version of this model and works really well for me.

I use this model for speed and I double check with 3.5 122b-a10b-Hybrid which gives me half of the speed but a bit better quality (though this 3.6-35B it’s actually pretty close to my workflow).

Here’s my recipe:

#Recipe: Qwen/Qwen3.6-35B-A3B-FP8  
#Qwen/Qwen3.6-35B-A3B model in native FP8 format

recipe_version: "1"
name: Qwen35-35B-A3B-Dflash
description: vLLM serving Qwen3.6-35B-A3B-FP8

#HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8
solo_only: true

#Container image to use
container: vllm-node-tf5

#Mod
mods:
  - mods/fix-qwen3.5-enhanced-chat-template

#Default settings (can be overridden via CLI)

defaults:
  port: 8000
  host: 0.0.0.0
  gpu_memory_utilization: 0.75
  max_model_len: 524288
#  max_model_len: 262144
  max_num_batched_tokens: 32768
  max-num-seqs: 8

#Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1

#The vLLM serve command template
command: |
  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name Qwen3.6-35B-A3B-FP8-DFlash \
  --host {host} \
  --port {port} \
  --max-model-len {max_model_len} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --load-format fastsafetensors \
  --default-chat-template-kwargs '{{"preserve_thinking": true}}' \
  --speculative-config '{{"method": "dflash", "model":"z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 4}}' \
  --attention-backend flash_attn \
  --enable-prefix-caching

Look closely to the chat template sections and maybe try those parameters?

Digital_David · April 29, 2026, 3:53pm

Update: after firmware updates and vllm 20.. about ~6% increase in LLM speed and tools usage Much faster..

mikee.gwu · April 29, 2026, 3:58pm

@azampatti thanks for posting this. Would love to give it a try as I have a similar use-case, but running into the following error.

Warning: Mod path not found: ./spark-vllm-docker/mods/fix-qwen3.5-enhanced-chat-template

I grabbed the latest spark-vllm-docker.git so I think this might be a custom implementation that isn’t in there (yet?).

There are few others built-in but not sure if suitable replacements.

ls spark-vllm-docker/mods/ |grep -i qwen

fix-qwen3.5-autoround
fix-qwen3.5-chat-template
fix-qwen35-tp4-marlin
fix-qwen3-coder-next
fix-qwen3-next-autoround

Thanks

mangosq · April 29, 2026, 4:00pm

Not sure what you’re comparing to, but if you’re relying solely on 35b for everything then you should expect a bit of a bumpy experience. Maybe try using a larger model or even 27b for design/plan, and let 35b speedrun through the grunt work.

There’s also another thread suggesting setting dtype to bfloat16. This might help with large context work.

azampatti · April 29, 2026, 4:03pm

check @whpthomas 's post here Bfloat16 Quality = Speed? :) He shares the the how-to to get this template up and running

Digital_David · April 29, 2026, 4:13pm

That’s the receipt I am using with really good success so far with OpenCLaw.

How’s this for some speed:

tool-eval-bench --base-url http://0.0.0.0:8000  --short --perf

 Tool-Call Benchmark
  Server: http://0.0.0.0:8000
  Querying http://0.0.0.0:8000/v1/models … ✓ Intel/Qwen3.6-35B-A3B-int4-AutoRound

  ✓ Warm-up complete (105 ms)
   Engine: vLLM 0.20.1rc1.dev55+g3f1a4bb63.d20260429

╭─────────────────── ⚡ llama-benchy Throughput Benchmark ───────────────────╮
│ Intel/Qwen3.6-35B-A3B-int4-AutoRound                                       │
│ pp=[2048]  tg=[128]  depth=[0, 4096, 8192]  concurrency=[1, 2, 4]  runs=3  │
│ latency=generation                                                         │
╰────────────────────────────────────────────────────────────────────────────╯

  ✓ Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 0:01:48

  llama-benchy 0.3.7
  Estimated latency: 41.5 ms

                              llama-benchy Results
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━
┃                      ┃    ┃         ┃         ┃    TTFT ┃    Total ┃
┃ Test                 ┃ c  ┃  pp t/s ┃  tg t/s ┃    (ms) ┃     (ms) ┃  Tokens
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━
│ pp2048 tg128 @ d0    │ c1 │   5,982 │    71.9 │     360 │    2,100 │ 2048+1…
│ pp2048 tg128 @ d0    │ c2 │   5,852 │   123.1 │     616 │    2,637 │ 2048+1…
│ pp2048 tg128 @ d0    │ c4 │   6,258 │   188.3 │   1,123 │    3,747 │ 2048+1…
│ pp2048 tg128 @ d4096 │ c1 │   6,592 │    70.2 │     887 │    2,669 │ 2048+1…
│ pp2048 tg128 @ d4096 │ c2 │   6,651 │   120.0 │   1,662 │    3,738 │ 2048+1…
│ pp2048 tg128 @ d4096 │ c4 │   6,807 │   184.8 │   3,202 │    5,881 │ 2048+1…
│ pp2048 tg128 @ d8192 │ c1 │   6,668 │    68.4 │   1,439 │    3,269 │ 2048+1…
│ pp2048 tg128 @ d8192 │ c2 │   6,611 │   110.6 │   2,745 │    4,959 │ 2048+1…
│ pp2048 tg128 @ d8192 │ c4 │   6,665 │   164.7 │   5,515 │    8,398 │ 2048+1…
└──────────────────────┴────┴─────────┴─────────┴─────────┴──────────┴────────


╭──────────────────────────  Benchmark Complete ───────────────────────────╮
│                                                                            │
│    Model:  Intel/Qwen3.6-35B-A3B-int4-AutoRound                            │
│    Score:  87 / 100                                                        │
│    Rating: ★★★★ Good                                                       │
│    Engine:       vLLM 0.20.1rc1.dev55+g3f1a4bb63.d20260429                 │
│    Quantization: INT4-AutoRound                                            │
│    Max context:  262,144 tokens                                            │
│                                                                            │
│    ✅ 12 passed   ⚠️  2 partial   ❌ 1 failed                              │
│    Points: 26/30                                                           │
│                                                                            │
│    Quality:        87/100                                                  │
│    Responsiveness: 70/100  (median turn: 1.7s)                             │
│    Deployability:  82/100  (α=0.7)                                         │
│    Weakest: A Tool Selection (67%)                                         │
│                                                                            │
│    Completed in 72.7s  │  tool-eval-bench v1.4.3.1                         │
│                                                                            │
│     Token Usage:                                                         │
│    Total: 37,725 tokens  │  Efficiency: 0.7 pts/1K tokens                  │
│                                                                            │
│    ⚡ Throughput:                                                          │
│    Single:  6,668 pp t/s  │  71.9 tg t/s  │  TTFT 360ms                    │
│    c2:      6,651 pp t/s  │  123.1 tg t/s                                  │
│    c4:      6,807 pp t/s  │  188.3 tg t/s                                  │
│                                                                            │
│    ── How this score is calculated ──                                      │
│    • Each scenario: pass=2pt, partial=1pt, fail=0pt                        │
│    • Category %: earned / max per category                                 │
│    • Final score: (total points / max points) × 100                        │
│    • Deployability: 0.7×quality + 0.3×responsiveness                       │
│    • Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s)     │
│                                                                            │
╰────────────────────────────────────────────────────────────────────────────╯

jbourny · April 29, 2026, 5:17pm

I would like to use the 27b but unfortunately it is too slow :( I was talking about Qwen3.6 35b, I understand that it is not as good as the 27b but it is supposed to at least do something and not loop endlessly until the server crashes. I’ve already tried dtype bfloat16 I guess it’s a little bit better but after a long context it’s still looping… I’m trying the azampatti’s receipe, Im building from zero the container to be certain there is not a corrupted file and we will see :)

Topic		Replies	Views
Qwen3.6-27B is out! DGX Spark / GB10 agentic-ai	57	10872	May 8, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	388	14011	May 9, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	15553	March 24, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5267	March 16, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	18	1761	April 16, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	9192	March 24, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9830	April 9, 2026
Qwen3.5-35B-A3B optimizations on single Spark DGX Spark / GB10 Projects	46	2238	May 4, 2026
Qwen3.6-27B-Dflash link DGX Spark / GB10 Projects	22	2686	April 29, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	229	7333	April 20, 2026

Qwen/Qwen3.6-35B-A3B (and FP8) has landed

Numbers

Setup

What I’ve already tried (no decode change either way)

Question

Related topics