Qwen3.6-27B is out!

With dflash, I am seeing over 50 t/s on some coding tasks with a dual spark.
Unfortunately it seems like one of my GX10s has suddenly started shutting down during load, might open it up to repaste it - it seems to be running a little hotter than the other one.


๐Ÿ”ง Tool-Call Benchmark
Server: http://0.0.0.0:8000
Querying http://0.0.0.0:8000/v1/models โ€ฆ โœ“ Qwen/Qwen3.6-27B-FP8

โœ“ Warm-up complete (455 ms)
๐Ÿ” Engine: vLLM 0.19.2rc1.dev120+g33ef1941e.d20260422
๐Ÿ” Quantization: FP8
๐Ÿ” Max context: 262,144 tokens

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ”ฎ Speculative Decoding Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Qwen/Qwen3.6-27B-FP8                                                                                                                                                                                          โ”‚
โ”‚ tg=128  depth=[0, 4096, 8192]  prompts=[โ€˜fillerโ€™, โ€˜codeโ€™, โ€˜structuredโ€™]  method=auto                                                                                                                          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
Prometheus /metrics acceptance-rate counters are server-wide aggregates. If other models are serving concurrent traffic on this endpoint, per-request acceptance rate measurements will be inaccurate. For clean measurements: use a single-model server with no concurrent load.
โœ“     filler @ d0  38.0 eff t/s  37.7 stream t/s  ฮฑ=25.4%  ฯ„=3.8
โœ“       code @ d0  57.5 eff t/s  57.1 stream t/s  ฮฑ=36.0%  ฯ„=5.4
โœ“ structured @ d0  46.5 eff t/s  46.1 stream t/s  ฮฑ=27.7%  ฯ„=4.2
โœ“     filler @ d4096  17.4 eff t/s  17.2 stream t/s  ฮฑ=18.3%  ฯ„=2.7
โœ“       code @ d4096  57.4 eff t/s  57.0 stream t/s  ฮฑ=36.0%  ฯ„=5.4
โœ“ structured @ d4096  46.6 eff t/s  46.3 stream t/s  ฮฑ=27.7%  ฯ„=4.2
โœ“     filler @ d8192  12.4 eff t/s  12.3 stream t/s  ฮฑ=16.4%  ฯ„=2.5
โœ“       code @ d8192  57.3 eff t/s  56.8 stream t/s  ฮฑ=36.0%  ฯ„=5.4
โœ“ structured @ d8192  46.5 eff t/s  46.1 stream t/s  ฮฑ=27.7%  ฯ„=4.2

               Speculative Decoding Results

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Prompt     โ”ƒ Depth โ”ƒ Eff t/s โ”ƒ   ฮฑ % โ”ƒ ฯ„ len โ”ƒ TTFT โ”ƒ Total ms โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ filler     โ”‚     0 โ”‚    38.0 โ”‚ 25.4% โ”‚   3.8 โ”‚    6 โ”‚    3,379 โ”‚
โ”‚ code       โ”‚     0 โ”‚    57.5 โ”‚ 36.0% โ”‚   5.4 โ”‚    7 โ”‚    2,233 โ”‚
โ”‚ structured โ”‚     0 โ”‚    46.5 โ”‚ 27.7% โ”‚   4.2 โ”‚    7 โ”‚    2,760 โ”‚
โ”‚ filler     โ”‚    4K โ”‚    17.4 โ”‚ 18.3% โ”‚   2.7 โ”‚   17 โ”‚    7,384 โ”‚
โ”‚ code       โ”‚    4K โ”‚    57.4 โ”‚ 36.0% โ”‚   5.4 โ”‚    8 โ”‚    2,237 โ”‚
โ”‚ structured โ”‚    4K โ”‚    46.6 โ”‚ 27.7% โ”‚   4.2 โ”‚    8 โ”‚    2,754 โ”‚
โ”‚ filler     โ”‚    8K โ”‚    12.4 โ”‚ 16.4% โ”‚   2.5 โ”‚   22 โ”‚   10,353 โ”‚
โ”‚ code       โ”‚    8K โ”‚    57.3 โ”‚ 36.0% โ”‚   5.4 โ”‚    7 โ”‚    2,242 โ”‚
โ”‚ structured โ”‚    8K โ”‚    46.5 โ”‚ 27.7% โ”‚   4.2 โ”‚    7 โ”‚    2,760 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Highest acceptance: code (36.0%)  Lowest: filler (16.4%)

Recipe used:

name: Qwen3.6-27B-FP8-Dflash

recipe_version: "1"

description: "vLLM serving Qwen3.6-27B in FP8 with Dflash speculative decoding, 262K context, tool calling"




model: Qwen/Qwen3.6-27B-FP8




container: vllm-node-tf5




build_args:

  - --tf5




defaults:

  port: 8000

  host: 0.0.0.0

  gpu_memory_utilization: 0.7

  max_model_len: 262144

  max_num_batched_tokens: 16384

  max_num_seqs: 4

  tensor_parallel: 2




env:

  VLLM_MARLIN_USE_ATOMIC_ADD: 1

  HF_TOKEN: <insert your HF token here>

command: |

  vllm serve Qwen/Qwen3.6-27B-FP8 \

    -O3 \

    --max-model-len {max_model_len} \

    --max-num-seqs {max_num_seqs} \

    --enable-prefix-caching \

    -tp {tensor_parallel} \

    --gpu-memory-utilization {gpu_memory_utilization} \

    --port {port} \

    --host {host} \

    --load-format fastsafetensors \

    --enable-chunked-prefill \

    --enable-auto-tool-choice \

    --distributed-executor-backend ray \

    --tool-call-parser qwen3_coder \

    --reasoning-parser qwen3 \

    --trust-remote-code \

    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \

    --speculative-config '{{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}}' \

    --attention-backend flash_attn \

    --max-num-batched-tokens 32768 \

    --generation-config auto \

    --override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}'

recipe_version: '1'

name: Qwen/Qwen3.6-27B-FP8-Dflash

cluster_only: false

solo_only: false
2 Likes

Lucebox-Hub added support for consumer Blackwell today: GitHub - Luce-Org/lucebox-hub: Lucebox optimization hub: hand-tuned LLM inference, built for specific consumer hardware. ยท GitHub

z-lab came out with Qwen3.6-27B-DFlash today: z-lab/Qwen3.6-27B-DFlash ยท Hugging Face

This is the first framework to support both DFlash and DDTree on GB10. I just got it working with the above. Benching is problematic as llama.cpp doesnโ€™t support metrics and speculative decoding is enabled. Here is a reference for everything at defaults:

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘  Benchmark: Qwen3.6-27B-Q4_K_M  โ€”  2026-04-23 17:50
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

  Warm-up... done

โ”€โ”€ Sequential (1 request) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Run 1/2:
  [Q&A       ]   256 tokens in   7.72s = 33.1 tok/s
  [Code      ]   512 tokens in  15.78s = 32.4 tok/s
  [JSON      ]  1024 tokens in  23.10s = 44.3 tok/s
  [Math      ]    32 tokens in    .88s = 36.1 tok/s
  [LongCode  ]  2048 tokens in  50.58s = 40.4 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   7.56s = 33.8 tok/s
  [Code      ]   512 tokens in  15.68s = 32.6 tok/s
  [JSON      ]  1024 tokens in  22.57s = 45.3 tok/s
  [Math      ]    32 tokens in    .89s = 35.6 tok/s
  [LongCode  ]  2048 tokens in  50.38s = 40.6 tok/s

Concurrency is nonexistant, prefill is poor (hardcodes ubatch=192 somewhere), itโ€™s llama.cpp under the hood. Spinning it up had some bumpiness. But it does in fact serve Qwen3.6-27B (in this case Q4_K_M) at speeds never seen before on one Spark.

The gains are mostly real, too - for domain text and complex stuff I see 25-28 tok/s actual.

We need DDTree in vLLM!

The question is also how much worse is Q4_K_M versus say FP8 in terms of intelligence and quality.

I tried vLLM Dflash on Qwen 3.6 27B Prismaquant 5.5bit I am getting surprisingly good numbers:
โ”€โ”€ Run 1/2 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
[Q&A] 256 tokens in 8.20s = 31.2 tok/s (prompt: 23)
[Code] 512 tokens in 15.46s = 33.1 tok/s (prompt: 30)
[JSON] 1024 tokens in 24.88s = 41.1 tok/s (prompt: 48)
[Math] 64 tokens in 1.66s = 38.5 tok/s (prompt: 29)
[LongCode] 2048 tokens in 61.00s = 33.5 tok/s (prompt: 37)

โ”€โ”€ Run 2/2 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
[Q&A] 256 tokens in 8.05s = 31.8 tok/s (prompt: 23)
[Code] 512 tokens in 15.45s = 33.1 tok/s (prompt: 30)
[JSON] 1024 tokens in 24.69s = 41.4 tok/s (prompt: 48)
[Math] 64 tokens in 1.65s = 38.7 tok/s (prompt: 29)
[LongCode] 2048 tokens in 60.98s = 33.5 tok/s (prompt: 37)

From experience, DFlash needs DDTree to hold up at this level for general use.

Whatโ€™s the feedback been with tool calling and fairly complex coding tasks? Iโ€™ve tried a few other Qwen models and theyโ€™ve been somewhat disappointing compared to other agentic-esk models. Iโ€™m using Minimax M2.7 right now. Canโ€™t find any benchmarks comparing the two directly, so figured Iโ€™d ask here.

have you tried the qwen models using the fixed template + qwen_xml tool parser? it seems to fix issues for a lot of folks especially when using it in open code

1 Like

Hmm, maybe I havenโ€™t used that fixed template, I was experiencing a lot of issues when using Qwen with Claude Code. Iโ€™ll go through Eugrs repo and see if I can find an example of the template and parser being used

1 Like

check out this thread Qwen3.5 Tool Calling finally fixed (possibly) - #22 by whpthomas

1 Like

Gave DFlash a try on my Dual Node Setup. 15 draft tokens may be a bit wasteful โ€“ a bunch of them are tossed away. Iโ€™ll experiment tomorrow.

                   Speculative Decoding Results
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Prompt     โ”ƒ Depth โ”ƒ Eff t/s โ”ƒ   ฮฑ % โ”ƒ ฯ„ len โ”ƒ TTFT โ”ƒ Total ms โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ filler     โ”‚     0 โ”‚    34.5 โ”‚ 23.2% โ”‚   3.5 โ”‚   11 โ”‚    3,725 โ”‚
โ”‚ code       โ”‚     0 โ”‚    52.2 โ”‚ 32.1% โ”‚   4.8 โ”‚    6 โ”‚    2,458 โ”‚
โ”‚ structured โ”‚     0 โ”‚    49.5 โ”‚ 30.7% โ”‚   4.6 โ”‚    6 โ”‚    2,592 โ”‚
โ”‚ filler     โ”‚    4K โ”‚    17.9 โ”‚ 19.2% โ”‚   2.9 โ”‚   14 โ”‚    7,147 โ”‚
โ”‚ code       โ”‚    4K โ”‚    50.6 โ”‚ 32.1% โ”‚   4.8 โ”‚    9 โ”‚    2,539 โ”‚
โ”‚ structured โ”‚    4K โ”‚    49.5 โ”‚ 30.7% โ”‚   4.6 โ”‚    6 โ”‚    2,589 โ”‚
โ”‚ filler     โ”‚    8K โ”‚    12.3 โ”‚ 16.5% โ”‚   2.5 โ”‚   19 โ”‚   10,433 โ”‚
โ”‚ code       โ”‚    8K โ”‚    50.7 โ”‚ 32.1% โ”‚   4.8 โ”‚    9 โ”‚    2,532 โ”‚
โ”‚ structured โ”‚    8K โ”‚    49.4 โ”‚ 30.7% โ”‚   4.6 โ”‚    6 โ”‚    2,596 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  Highest acceptance: code (32.1%)  Lowest: filler (16.5%)
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ”ง Tool-Call Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Qwen/Qwen3.6-27B-FP8  via vllm @ http://0.0.0.0:8080                                 โ”‚
โ”‚ 15 scenarios  v1.4.1                                                                 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

  โ— TC-01  Direct Specialist Match         โœ… PASS  2/2   9.0s  ttft=2,514ms t2  Used
get_weather with Berlin only.
  โ— TC-02  Distractor Resistance           โœ… PASS  2/2   6.3s  ttft=1,827ms t2  Used
only get_stock_price for AAPL.
  โ— TC-03  Implicit Tool Need              โœ… PASS  2/2  13.2s  ttft=4,111ms t3  Looked
up Sarah before sending the email.
  โ— TC-04  Unit Handling                   โœ… PASS  2/2   6.6s  ttft=2,257ms t2
Requested Tokyo weather in Fahrenheit explicitly.
  โ— TC-05  Date and Time Parsing           โœ… PASS  2/2  17.2s  ttft=9,524ms t2  Parsed
next Monday and included the requested meeting details.
  โ— TC-06  Multi-Value Extraction          โœ… PASS  2/2  10.2s  ttft=4,666ms t2  Issued
separate translate_text calls for both languages.
  โ— TC-07  Search โ†’ Read โ†’ Act             โœ… PASS  2/2  20.1s  ttft=2,984ms t5
Completed the full four-step chain with the right data.
  โ— TC-08  Conditional Branching           โœ… PASS  2/2  16.1s  ttft=5,811ms t3  Checked
the weather first, then set the rainy-day reminder.
  โ— TC-09  Parallel Independence           โœ… PASS  2/2  10.2s  ttft=3,302ms t2  Handled
both independent tasks.
  โ— TC-10  Trivial Knowledge               โœ… PASS  2/2   3.2s  ttft=3,091ms  Answered
directly without tool use.
  โ— TC-11  Simple Math                     โœ… PASS  2/2   8.1s  ttft=7,996ms  Did the
math directly โ€” good restraint.
  โ— TC-12  Impossible Request              โœ… PASS  2/2  13.0s  ttft=6,326ms  Refused
cleanly because no delete-email tool exists.
  โ— TC-13  Empty Results                   โœ… PASS  2/2  17.0s  ttft=2,765ms t4  Retried
after the empty result and recovered.
  โ— TC-14  Malformed Response              โš ๏ธ  PARTIAL  1/2   7.8s  ttft=2,054ms t2
Acknowledged the error but did not attempt an alternative source.
  โ— TC-15  Conflicting Information         โœ… PASS  2/2  11.0s  ttft=2,517ms t3  Used
the searched population value in the calculator.

in claude code I had better luck with 8 but even then I didnโ€™t really see draft rates go above 50%. also we might see better rates when the actual 3.6 dflash model gets released for the 27b model. you where using z-lab/Qwen3.5-27B-DFlash right?

The 3.6 is already in preview here z-lab/Qwen3.6-27B-DFlash ยท Hugging Face

1 Like

Iโ€™m currently trying out the FP8 variant on my 2x Asus Ascent GX10

Working recipe: (had to rebuild the vllm-node-tf5 container first, as I didnโ€™t have the latest version built with the new instant weight loader)

# Recipe: Qwen3.6-27B-FP8
# vLLM serving Qwen3.6-27B in FP8 with MTP speculative decoding, 262K context, tool calling

recipe_version: "1"
name: Qwen3.6-27B-FP8
description: "vLLM serving Qwen3.6-27B in FP8 with MTP speculative decoding, 262K context, tool calling"

model: Qwen/Qwen3.6-27B-FP8

cluster_only: true

container: vllm-node-tf5

build_args:
  - --tf5

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 16384
  max_num_seqs: 4

env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

command: |
  vllm serve Qwen/Qwen3.6-27B-FP8 \
    -O3 \
    --max-model-len {max_model_len} \
    --max-num-seqs {max_num_seqs} \
    --enable-prefix-caching \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --port {port} \
    --host {host} \
    --load-format instanttensor \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --trust-remote-code \
    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \
    --speculative-config '{{"method": "qwen3_next_mtp", "num_speculative_tokens": 3}}' \
    --generation-config auto \
    --override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}' \
    -tp {tensor_parallel}

Performance feels similar to what I was getting the whole week with Minimax M2.7, but itโ€™s my first time using MTP and it feels a bit inconsistent. PP feels slower though.

Here is a session of agentic coding, from 0 K t o 100 K context:

One thing I noticed is how often the model just stops. I didnโ€™t have to think about this issue for two whole weeks with M2.7. I did build a way to have the agent auto-continue in https://www.npmjs.com/package/openfox , with the planner creating criteria and the builder can loop while theyโ€™re not completed.

Iโ€™m still evaluating its capability, but it feels strong.

I must have missed it, I thought I had checked for this today. Thanks!

Its not publicly listed :)

Here is a real world benchmark I run at the moment. Its a real task that reflects exactly what I am doing all day basically, runtime only. Its heavy code reading and verification and vulnerability verification / bug finding stuff. Its not running parallel requests, then the RTX gets way ahead obviously. The Spark also improves a bit tho.
Everything is measured after new bootup but after warmup requests. The results had similiar accuracy and all were correct.

Qwen3.6-27b-prismaquant

DFlash: 12min
MTP-3: 10min
NoMTP: 16min

Qwen3.6-27b-FP8 on RTX6000 PRO

MTP1: 7Min

Intel AutoRound Quants are also up now:

1 Like

Hi, is this on one spark? And how do you run this?

Autoround Tested with DFlash

โ•โ•โ• Benchmark โ•โ•โ•
[โœ“] Model: Intel/Qwen3.6-27B-int4-AutoRound

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘  Benchmark: Qwen3.6-27B-int4-AutoRound  โ€”  2026-04-25 01:35
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

  Warm-up... done

โ”€โ”€ Sequential (1 request) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Run 1/2:
  [Q&A       ]   256 tokens in   5.69s = 44.9 tok/s
  [Code      ]   512 tokens in  12.39s = 41.3 tok/s
  [JSON      ]  1024 tokens in  18.02s = 56.8 tok/s
  [Math      ]    32 tokens in    .53s = 60.1 tok/s
  [LongCode  ]  2048 tokens in  45.73s = 44.7 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   5.66s = 45.1 tok/s
  [Code      ]   512 tokens in  12.43s = 41.1 tok/s
  [JSON      ]  1024 tokens in  17.91s = 57.1 tok/s
  [Math      ]    32 tokens in    .53s = 60.3 tok/s
  [LongCode  ]  2048 tokens in  45.76s = 44.7 tok/s

โ”€โ”€ Concurrent (4 parallel requests) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 22.7 tok/s (end-to-end)
  [req2 ]  1024 tokens = 23.4 tok/s (end-to-end)
  [req3 ]  1024 tokens = 22.7 tok/s (end-to-end)
  [req4 ]  1024 tokens = 22.7 tok/s (end-to-end)

  Total: 4096 tokens in 45.13s
  Total throughput: 90.7 tok/s (4 requests completed)
./spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 \
--solo \
-d -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
   -e HF_TOKEN=${HF_TOKEN} \
exec vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 262144 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.8 \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3  \
  --tool-call-parser qwen3_xml \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"preserve_thinking":true}' \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn
1 Like

Wonder how autoround compares to quality of FP8 and 5.5bit Prismaquant. For me FP8 and Prismaquant are comparable, the question is how much worse is Int4 Autoround?

On tool bench int4 autoround got 88/100 points, vs 93/100 on FP8 quant. TG was nearly double on int4-autoround, but interestingly PP was almost double on FP8.

Makes sense pp is faster with fp8 since it doesnโ€™t dequant to fp16 compared to autoround (w4a16)