Qwen/Qwen3.6-35B-A3B (and FP8) has landed

Got error:
vllm serve: error: argument --speculative-config/-sc: Value {โ€œmethodโ€:โ€œdflashโ€,โ€œmodelโ€: โ€œz-lab/Qwen3.6-35B-A3B-DFlashโ€, โ€œnum_speculative_tokensโ€:4} cannot be converted to <function loads at 0xed81d62993a0>.
What version of vLLM are you using with DFlash?

Iโ€™m running @eugr community docker, updated with the latest from today

vllm: 0.19.2rc1.dev213+g9558f4390.d20260426

I had to manually replace ever โ€œโ€ , โ€˜โ€™โ€™ and - (quotes and dashes) everytime I copy-paste from this forum.

do that and should work

Thanks for this. DFLash is working 14tok/sec faster with DFlash than MPT in my workload. Thatโ€™s substancial ! :) 65 to 80t/s

Parallel x4 is pretty much the same as before, but the single-session performance is welcomed :)

Could not see the raise in t/sโ€ฆ investigating the differences in configsโ€ฆ

Might want to scroll up and reread, Dflash configs and test have been completed previously. :=-)

Testing such config for exiting infinite loop thinking:
โ€“override-generation-config โ€˜{{โ€œtemperatureโ€: 0.6, โ€œtop_pโ€: 0.95, โ€œtop_kโ€: 20, โ€œpresence_penaltyโ€: 0.0, โ€œrepetition_penaltyโ€: 1.0, โ€œmin_pโ€: 0.2}}โ€™ \
Bad news is that min_p is not yet supported by speculative decoding.

Hello, we use latest nvidia vllm (26.04-py3) with docker on single spark with few parameters atm works fine ~51 token/s next step trying with MTP 2/3:

โ€“gpu-memory-utilization 0.8

โ€“defaut-chat-template-kwargs โ€˜{โ€œenable_thinkingโ€: false}โ€™

โ€“enable-auto-tool-choice

โ€“tool-call-parser qwen3_coder

Hey all โ€” looking for some help from the dual-Spark crowd.

Running Qwen3.6-35B-A3B-FP8 on a 2ร— DGX Spark cluster (CX7 200G direct link, tp=2 over Ray) and consistently seeing ~67 tok/s single-stream decode. Saw @serapis quote 77.74 ยฑ 0.44 tok/s on the same model + topology earlier in this thread โ€” really hoping someone can help me figure out what Iโ€™m missing.

Numbers

Metric Mine post #5 reference
Prefill (pp2048-class) 7,920 tok/s 7,824 ยฑ 162
Decode (tg128, dual-Spark) 66.8 tok/s 77.74 ยฑ 0.44
Decode (tg128, tp=1 single) 53.6 tok/s 75.1 (post #11)

Methodology: 1.5K-token prompt + 128 decode tokens, ignore_eos=True, 5 runs, very tight error bars (~0.5 tok/s). Decode rate is inferred by subtracting short-decode time (tg8 vs tg128) to exclude prefill cost. Prefill matches the published number almost exactly โ€” only decode is off.

Setup

  • Hardware: 2ร— DGX Spark, CX7 200G direct link (RoCE, NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1)

  • Container: eugr/spark-vllm-docker with the latest prebuilt wheel โ€” prebuilt-vllm-current from 2026-04-28 (vLLM 0.20.1rc1.dev23+gde3da0b97), flashinfer 0.6.9

  • vLLM args (production recipe):

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 262144 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.7 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --kv-cache-dtype fp8 \
  --load-format fastsafetensors \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --chat-template unsloth.jinja \
  -tp 2 \
  --distributed-executor-backend ray

What Iโ€™ve already tried (no decode change either way)

  1. Stripped config โ€” dropped tool-parser, reasoning-parser, prefix-caching, --chat-template. Lowered --max-model-len to 32768, --max-num-batched-tokens to 8192, --load-format=instanttensor. Same 67 tok/s decode.

  2. vLLM bump โ€” was on 0.19.2rc1.dev213+g9558f4390, rebuilt to 0.20.1rc1.dev23+gde3da0b97 (post-v0.20.0 stable). Same number.

  3. tp=1 isolation โ€” bare config on a single Spark: 53.6 tok/s decode. Thatโ€™s 22 tok/s below the post #11 single-Spark figure, so the gap isnโ€™t interconnect-related โ€” it shows up on the single-node path too.

During steady-state decode I sampled nvidia-smi per 500 ms:

  • GPU clock: 2405 MHz (locked at max boost โ€” no throttling)

  • Power draw: 21โ€“33 W, average ~28 W (vs ~140 W TDP)

So SMs are sitting idle most of the time waiting on memory loads โ€” itโ€™s not a clock or thermal issue. Decode efficiency vs theoretical ceiling (~91 tok/s from 273 GB/s รท 3 GB-per-A3B-token):

  • Mine: 73%

  • Reference: 85%

That ~12% efficiency gap is real and I canโ€™t reach it via config.

Question

What am I missing? Specifically:

  • Container/image โ€” for those of you hitting 75+ on single-Spark or ~78 on dual, what container did you actually run? The thread pins vLLM args but not the build. Is anyone on NGC vLLM (nvcr.io/nvidia/vllm:26.03.post1-py3)? A specific upstream vLLM commit? A custom flashinfer build?

  • MoE kernel selection โ€” anything I should check / pin / override? Any env vars beyond VLLM_MARLIN_USE_ATOMIC_ADD=1?

  • Power / firmware โ€” is there a Spark BIOS or driver tweak (power profile, persistence mode, etc.) that meaningfully changes the picture?

  • Anything else โ€” happy to run further bench variations and share results if it helps narrow it down.

Thanks in advance โ€” really appreciate any pointers.

intel released the fixed version: Intel/Qwen3.6-35B-A3B-int4-mixed-AutoRound ยท Hugging Face

Looks like a mistake.. FP8 size is 38Gb, this one (int4) - 43Gb.

Lol how many times is Intel gonna mess up this quant? Itโ€™s been pulled again

Hello,

I canโ€™t understand why no one is saying that Qwen3.6 or vLLM isnโ€™t stable at all. Iโ€™ve tried all the vLLM versions and almost all the Qwen3.6 35B models (FP8, INT4, NVFP4 with and without distillation), and they all have the same problem: Endless repetition during reflections, the model starts creating a large file, then once itโ€™s finished, it decides, โ€œOh, actually, no, I donโ€™t like it, Iโ€™ll do it differently,โ€ and this can go on for several times. The same thing happens when it corrects a file; it will correct it multiple times. Sometimes it will even reflect, say something, use a tool, reflect, say the same thing again, use the same tool with the same parameters, and so on dozens of times, even indefinitely.

I tried every possible setting, starting with the recommended one. The last one that seemed stable was `{repetition_penalty:1.1,temperature:0.4}`, but ultimately, after a while, around 30k contexts, it keeps repeating.

This happens with or without `preserve_thinking`, whether in Claude Code, VS Code, or even custom-built assistants.

My second ongoing issue, whether itโ€™s vLLM Nightly or even the latest stable version vLLM 0.20 and vLLM eugr (which is based on the nightly version), is the tool calls. The tools end up as XML in the thinking_content output, forcing me to patch them everywhere. Qwen has been releasing templates for two years, and no one has been able to get vLLM working out of the box with this fix. I donโ€™t understandโ€ฆ

Could someone please create a ZIP file containing all the patches or commands needed to make it work properly? Iโ€™m starting to despair :(

Have you tried a different chat template? I do see what you say in OpenWebUI when testing a few things, but I mostly use Claude Code (and OpenCode in minor things, but Claude Code is more stable for me than anything else, albeit slower) with FP8 version of this model and works really well for me.

I use this model for speed and I double check with 3.5 122b-a10b-Hybrid which gives me half of the speed but a bit better quality (though this 3.6-35B itโ€™s actually pretty close to my workflow).

Hereโ€™s my recipe:

#Recipe: Qwen/Qwen3.6-35B-A3B-FP8  
#Qwen/Qwen3.6-35B-A3B model in native FP8 format

recipe_version: "1"
name: Qwen35-35B-A3B-Dflash
description: vLLM serving Qwen3.6-35B-A3B-FP8

#HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8
solo_only: true

#Container image to use
container: vllm-node-tf5

#Mod
mods:
  - mods/fix-qwen3.5-enhanced-chat-template

#Default settings (can be overridden via CLI)

defaults:
  port: 8000
  host: 0.0.0.0
  gpu_memory_utilization: 0.75
  max_model_len: 524288
#  max_model_len: 262144
  max_num_batched_tokens: 32768
  max-num-seqs: 8

#Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1

#The vLLM serve command template
command: |
  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name Qwen3.6-35B-A3B-FP8-DFlash \
  --host {host} \
  --port {port} \
  --max-model-len {max_model_len} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --load-format fastsafetensors \
  --default-chat-template-kwargs '{{"preserve_thinking": true}}' \
  --speculative-config '{{"method": "dflash", "model":"z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 4}}' \
  --attention-backend flash_attn \
  --enable-prefix-caching

Look closely to the chat template sections and maybe try those parameters?

Update: after firmware updates and vllm 20.. about ~6% increase in LLM speed and tools usage Much faster..

@azampatti thanks for posting this. Would love to give it a try as I have a similar use-case, but running into the following error.

Warning: Mod path not found: ./spark-vllm-docker/mods/fix-qwen3.5-enhanced-chat-template

I grabbed the latest spark-vllm-docker.git so I think this might be a custom implementation that isnโ€™t in there (yet?).

There are few others built-in but not sure if suitable replacements.

ls spark-vllm-docker/mods/ |grep -i qwen

fix-qwen3.5-autoround
fix-qwen3.5-chat-template
fix-qwen35-tp4-marlin
fix-qwen3-coder-next
fix-qwen3-next-autoround

Thanks

Not sure what youโ€™re comparing to, but if youโ€™re relying solely on 35b for everything then you should expect a bit of a bumpy experience. Maybe try using a larger model or even 27b for design/plan, and let 35b speedrun through the grunt work.

Thereโ€™s also another thread suggesting setting dtype to bfloat16. This might help with large context work.

check @whpthomas 's post here Bfloat16 Quality = Speed? :) He shares the the how-to to get this template up and running

Thatโ€™s the receipt I am using with really good success so far with OpenCLaw.

Howโ€™s this for some speed:

tool-eval-bench --base-url http://0.0.0.0:8000  --short --perf

๏”ง Tool-Call Benchmark
  Server: http://0.0.0.0:8000
  Querying http://0.0.0.0:8000/v1/models โ€ฆ โœ“ Intel/Qwen3.6-35B-A3B-int4-AutoRound

  โœ“ Warm-up complete (105 ms)
  ๏” Engine: vLLM 0.20.1rc1.dev55+g3f1a4bb63.d20260429

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โšก llama-benchy Throughput Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Intel/Qwen3.6-35B-A3B-int4-AutoRound                                       โ”‚
โ”‚ pp=[2048]  tg=[128]  depth=[0, 4096, 8192]  concurrency=[1, 2, 4]  runs=3  โ”‚
โ”‚ latency=generation                                                         โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

  โœ“ Complete โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 27/27 0:01:48

  llama-benchy 0.3.7
  Estimated latency: 41.5 ms

                              llama-benchy Results
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”
โ”ƒ                      โ”ƒ    โ”ƒ         โ”ƒ         โ”ƒ    TTFT โ”ƒ    Total โ”ƒ
โ”ƒ Test                 โ”ƒ c  โ”ƒ  pp t/s โ”ƒ  tg t/s โ”ƒ    (ms) โ”ƒ     (ms) โ”ƒ  Tokens
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”
โ”‚ pp2048 tg128 @ d0    โ”‚ c1 โ”‚   5,982 โ”‚    71.9 โ”‚     360 โ”‚    2,100 โ”‚ 2048+1โ€ฆ
โ”‚ pp2048 tg128 @ d0    โ”‚ c2 โ”‚   5,852 โ”‚   123.1 โ”‚     616 โ”‚    2,637 โ”‚ 2048+1โ€ฆ
โ”‚ pp2048 tg128 @ d0    โ”‚ c4 โ”‚   6,258 โ”‚   188.3 โ”‚   1,123 โ”‚    3,747 โ”‚ 2048+1โ€ฆ
โ”‚ pp2048 tg128 @ d4096 โ”‚ c1 โ”‚   6,592 โ”‚    70.2 โ”‚     887 โ”‚    2,669 โ”‚ 2048+1โ€ฆ
โ”‚ pp2048 tg128 @ d4096 โ”‚ c2 โ”‚   6,651 โ”‚   120.0 โ”‚   1,662 โ”‚    3,738 โ”‚ 2048+1โ€ฆ
โ”‚ pp2048 tg128 @ d4096 โ”‚ c4 โ”‚   6,807 โ”‚   184.8 โ”‚   3,202 โ”‚    5,881 โ”‚ 2048+1โ€ฆ
โ”‚ pp2048 tg128 @ d8192 โ”‚ c1 โ”‚   6,668 โ”‚    68.4 โ”‚   1,439 โ”‚    3,269 โ”‚ 2048+1โ€ฆ
โ”‚ pp2048 tg128 @ d8192 โ”‚ c2 โ”‚   6,611 โ”‚   110.6 โ”‚   2,745 โ”‚    4,959 โ”‚ 2048+1โ€ฆ
โ”‚ pp2048 tg128 @ d8192 โ”‚ c4 โ”‚   6,665 โ”‚   164.7 โ”‚   5,515 โ”‚    8,398 โ”‚ 2048+1โ€ฆ
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๏† Benchmark Complete โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                                                            โ”‚
โ”‚    Model:  Intel/Qwen3.6-35B-A3B-int4-AutoRound                            โ”‚
โ”‚    Score:  87 / 100                                                        โ”‚
โ”‚    Rating: โ˜…โ˜…โ˜…โ˜… Good                                                       โ”‚
โ”‚    Engine:       vLLM 0.20.1rc1.dev55+g3f1a4bb63.d20260429                 โ”‚
โ”‚    Quantization: INT4-AutoRound                                            โ”‚
โ”‚    Max context:  262,144 tokens                                            โ”‚
โ”‚                                                                            โ”‚
โ”‚    โœ… 12 passed   โš ๏ธ  2 partial   โŒ 1 failed                              โ”‚
โ”‚    Points: 26/30                                                           โ”‚
โ”‚                                                                            โ”‚
โ”‚    Quality:        87/100                                                  โ”‚
โ”‚    Responsiveness: 70/100  (median turn: 1.7s)                             โ”‚
โ”‚    Deployability:  82/100  (ฮฑ=0.7)                                         โ”‚
โ”‚    Weakest: A Tool Selection (67%)                                         โ”‚
โ”‚                                                                            โ”‚
โ”‚    Completed in 72.7s  โ”‚  tool-eval-bench v1.4.3.1                         โ”‚
โ”‚                                                                            โ”‚
โ”‚    ๏“Š Token Usage:                                                         โ”‚
โ”‚    Total: 37,725 tokens  โ”‚  Efficiency: 0.7 pts/1K tokens                  โ”‚
โ”‚                                                                            โ”‚
โ”‚    โšก Throughput:                                                          โ”‚
โ”‚    Single:  6,668 pp t/s  โ”‚  71.9 tg t/s  โ”‚  TTFT 360ms                    โ”‚
โ”‚    c2:      6,651 pp t/s  โ”‚  123.1 tg t/s                                  โ”‚
โ”‚    c4:      6,807 pp t/s  โ”‚  188.3 tg t/s                                  โ”‚
โ”‚                                                                            โ”‚
โ”‚    โ”€โ”€ How this score is calculated โ”€โ”€                                      โ”‚
โ”‚    โ€ข Each scenario: pass=2pt, partial=1pt, fail=0pt                        โ”‚
โ”‚    โ€ข Category %: earned / max per category                                 โ”‚
โ”‚    โ€ข Final score: (total points / max points) ร— 100                        โ”‚
โ”‚    โ€ข Deployability: 0.7ร—quality + 0.3ร—responsiveness                       โ”‚
โ”‚    โ€ข Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s)     โ”‚
โ”‚                                                                            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

I would like to use the 27b but unfortunately it is too slow :( I was talking about Qwen3.6 35b, I understand that it is not as good as the 27b but it is supposed to at least do something and not loop endlessly until the server crashes. Iโ€™ve already tried dtype bfloat16 I guess itโ€™s a little bit better but after a long context itโ€™s still loopingโ€ฆ Iโ€™m trying the azampattiโ€™s receipe, Im building from zero the container to be certain there is not a corrupted file and we will see :)