Qwen3.6-27B is out!

arctic.gus · April 23, 2026, 9:43pm

With dflash, I am seeing over 50 t/s on some coding tasks with a dual spark.
Unfortunately it seems like one of my GX10s has suddenly started shutting down during load, might open it up to repaste it - it seems to be running a little hotter than the other one.


🔧 Tool-Call Benchmark
Server: http://0.0.0.0:8000
Querying http://0.0.0.0:8000/v1/models … ✓ Qwen/Qwen3.6-27B-FP8

✓ Warm-up complete (455 ms)
🔍 Engine: vLLM 0.19.2rc1.dev120+g33ef1941e.d20260422
🔍 Quantization: FP8
🔍 Max context: 262,144 tokens

╭────────────────────────────────────────────────────────────────────────────────────── 🔮 Speculative Decoding Benchmark ──────────────────────────────────────────────────────────────────────────────────────╮
│ Qwen/Qwen3.6-27B-FP8                                                                                                                                                                                          │
│ tg=128  depth=[0, 4096, 8192]  prompts=[‘filler’, ‘code’, ‘structured’]  method=auto                                                                                                                          │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Prometheus /metrics acceptance-rate counters are server-wide aggregates. If other models are serving concurrent traffic on this endpoint, per-request acceptance rate measurements will be inaccurate. For clean measurements: use a single-model server with no concurrent load.
✓     filler @ d0  38.0 eff t/s  37.7 stream t/s  α=25.4%  τ=3.8
✓       code @ d0  57.5 eff t/s  57.1 stream t/s  α=36.0%  τ=5.4
✓ structured @ d0  46.5 eff t/s  46.1 stream t/s  α=27.7%  τ=4.2
✓     filler @ d4096  17.4 eff t/s  17.2 stream t/s  α=18.3%  τ=2.7
✓       code @ d4096  57.4 eff t/s  57.0 stream t/s  α=36.0%  τ=5.4
✓ structured @ d4096  46.6 eff t/s  46.3 stream t/s  α=27.7%  τ=4.2
✓     filler @ d8192  12.4 eff t/s  12.3 stream t/s  α=16.4%  τ=2.5
✓       code @ d8192  57.3 eff t/s  56.8 stream t/s  α=36.0%  τ=5.4
✓ structured @ d8192  46.5 eff t/s  46.1 stream t/s  α=27.7%  τ=4.2

               Speculative Decoding Results

┏━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━┓
┃ Prompt     ┃ Depth ┃ Eff t/s ┃   α % ┃ τ len ┃ TTFT ┃ Total ms ┃
┡━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━┩
│ filler     │     0 │    38.0 │ 25.4% │   3.8 │    6 │    3,379 │
│ code       │     0 │    57.5 │ 36.0% │   5.4 │    7 │    2,233 │
│ structured │     0 │    46.5 │ 27.7% │   4.2 │    7 │    2,760 │
│ filler     │    4K │    17.4 │ 18.3% │   2.7 │   17 │    7,384 │
│ code       │    4K │    57.4 │ 36.0% │   5.4 │    8 │    2,237 │
│ structured │    4K │    46.6 │ 27.7% │   4.2 │    8 │    2,754 │
│ filler     │    8K │    12.4 │ 16.4% │   2.5 │   22 │   10,353 │
│ code       │    8K │    57.3 │ 36.0% │   5.4 │    7 │    2,242 │
│ structured │    8K │    46.5 │ 27.7% │   4.2 │    7 │    2,760 │
└────────────┴───────┴─────────┴───────┴───────┴──────┴──────────┘

Highest acceptance: code (36.0%)  Lowest: filler (16.4%)

Recipe used:

name: Qwen3.6-27B-FP8-Dflash

recipe_version: "1"

description: "vLLM serving Qwen3.6-27B in FP8 with Dflash speculative decoding, 262K context, tool calling"




model: Qwen/Qwen3.6-27B-FP8




container: vllm-node-tf5




build_args:

  - --tf5




defaults:

  port: 8000

  host: 0.0.0.0

  gpu_memory_utilization: 0.7

  max_model_len: 262144

  max_num_batched_tokens: 16384

  max_num_seqs: 4

  tensor_parallel: 2




env:

  VLLM_MARLIN_USE_ATOMIC_ADD: 1

  HF_TOKEN: <insert your HF token here>

command: |

  vllm serve Qwen/Qwen3.6-27B-FP8 \

    -O3 \

    --max-model-len {max_model_len} \

    --max-num-seqs {max_num_seqs} \

    --enable-prefix-caching \

    -tp {tensor_parallel} \

    --gpu-memory-utilization {gpu_memory_utilization} \

    --port {port} \

    --host {host} \

    --load-format fastsafetensors \

    --enable-chunked-prefill \

    --enable-auto-tool-choice \

    --distributed-executor-backend ray \

    --tool-call-parser qwen3_coder \

    --reasoning-parser qwen3 \

    --trust-remote-code \

    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \

    --speculative-config '{{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}}' \

    --attention-backend flash_attn \

    --max-num-batched-tokens 32768 \

    --generation-config auto \

    --override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}'

recipe_version: '1'

name: Qwen/Qwen3.6-27B-FP8-Dflash

cluster_only: false

solo_only: false

joshua.dale.warner · April 23, 2026, 11:01pm

Lucebox-Hub added support for consumer Blackwell today: GitHub - Luce-Org/lucebox-hub: Lucebox optimization hub: hand-tuned LLM inference, built for specific consumer hardware. · GitHub

z-lab came out with Qwen3.6-27B-DFlash today: z-lab/Qwen3.6-27B-DFlash · Hugging Face

This is the first framework to support both DFlash and DDTree on GB10. I just got it working with the above. Benching is problematic as llama.cpp doesn’t support metrics and speculative decoding is enabled. Here is a reference for everything at defaults:

╔══════════════════════════════════════════════════════╗
║  Benchmark: Qwen3.6-27B-Q4_K_M  —  2026-04-23 17:50
╚══════════════════════════════════════════════════════╝

  Warm-up... done

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   7.72s = 33.1 tok/s
  [Code      ]   512 tokens in  15.78s = 32.4 tok/s
  [JSON      ]  1024 tokens in  23.10s = 44.3 tok/s
  [Math      ]    32 tokens in    .88s = 36.1 tok/s
  [LongCode  ]  2048 tokens in  50.58s = 40.4 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   7.56s = 33.8 tok/s
  [Code      ]   512 tokens in  15.68s = 32.6 tok/s
  [JSON      ]  1024 tokens in  22.57s = 45.3 tok/s
  [Math      ]    32 tokens in    .89s = 35.6 tok/s
  [LongCode  ]  2048 tokens in  50.38s = 40.6 tok/s

Concurrency is nonexistant, prefill is poor (hardcodes ubatch=192 somewhere), it’s llama.cpp under the hood. Spinning it up had some bumpiness. But it does in fact serve Qwen3.6-27B (in this case Q4_K_M) at speeds never seen before on one Spark.

The gains are mostly real, too - for domain text and complex stuff I see 25-28 tok/s actual.

We need DDTree in vLLM!

djordjestojanovic1992 · April 23, 2026, 11:41pm

The question is also how much worse is Q4_K_M versus say FP8 in terms of intelligence and quality.

I tried vLLM Dflash on Qwen 3.6 27B Prismaquant 5.5bit I am getting surprisingly good numbers:
── Run 1/2 ──────────────────────────────────────
[Q&A] 256 tokens in 8.20s = 31.2 tok/s (prompt: 23)
[Code] 512 tokens in 15.46s = 33.1 tok/s (prompt: 30)
[JSON] 1024 tokens in 24.88s = 41.1 tok/s (prompt: 48)
[Math] 64 tokens in 1.66s = 38.5 tok/s (prompt: 29)
[LongCode] 2048 tokens in 61.00s = 33.5 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 8.05s = 31.8 tok/s (prompt: 23)
[Code] 512 tokens in 15.45s = 33.1 tok/s (prompt: 30)
[JSON] 1024 tokens in 24.69s = 41.4 tok/s (prompt: 48)
[Math] 64 tokens in 1.65s = 38.7 tok/s (prompt: 29)
[LongCode] 2048 tokens in 60.98s = 33.5 tok/s (prompt: 37)

jwarner · April 24, 2026, 12:07am

From experience, DFlash needs DDTree to hold up at this level for general use.

fishnotphish · April 24, 2026, 12:29am

What’s the feedback been with tool calling and fairly complex coding tasks? I’ve tried a few other Qwen models and they’ve been somewhat disappointing compared to other agentic-esk models. I’m using Minimax M2.7 right now. Can’t find any benchmarks comparing the two directly, so figured I’d ask here.

pnivek.dev · April 24, 2026, 1:25am

have you tried the qwen models using the fixed template + qwen_xml tool parser? it seems to fix issues for a lot of folks especially when using it in open code

fishnotphish · April 24, 2026, 1:32am

Hmm, maybe I haven’t used that fixed template, I was experiencing a lot of issues when using Qwen with Claude Code. I’ll go through Eugrs repo and see if I can find an example of the template and parser being used

pnivek.dev · April 24, 2026, 2:04am

check out this thread Qwen3.5 Tool Calling finally fixed (possibly) - #22 by whpthomas

serapis · April 24, 2026, 6:12am

Gave DFlash a try on my Dual Node Setup. 15 draft tokens may be a bit wasteful – a bunch of them are tossed away. I’ll experiment tomorrow.

                   Speculative Decoding Results
┏━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━┓
┃ Prompt     ┃ Depth ┃ Eff t/s ┃   α % ┃ τ len ┃ TTFT ┃ Total ms ┃
┡━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━┩
│ filler     │     0 │    34.5 │ 23.2% │   3.5 │   11 │    3,725 │
│ code       │     0 │    52.2 │ 32.1% │   4.8 │    6 │    2,458 │
│ structured │     0 │    49.5 │ 30.7% │   4.6 │    6 │    2,592 │
│ filler     │    4K │    17.9 │ 19.2% │   2.9 │   14 │    7,147 │
│ code       │    4K │    50.6 │ 32.1% │   4.8 │    9 │    2,539 │
│ structured │    4K │    49.5 │ 30.7% │   4.6 │    6 │    2,589 │
│ filler     │    8K │    12.3 │ 16.5% │   2.5 │   19 │   10,433 │
│ code       │    8K │    50.7 │ 32.1% │   4.8 │    9 │    2,532 │
│ structured │    8K │    49.4 │ 30.7% │   4.6 │    6 │    2,596 │
└────────────┴───────┴─────────┴───────┴───────┴──────┴──────────┘

  Highest acceptance: code (32.1%)  Lowest: filler (16.5%)

╭─────────────────────────────── 🔧 Tool-Call Benchmark ───────────────────────────────╮
│ Qwen/Qwen3.6-27B-FP8  via vllm @ http://0.0.0.0:8080                                 │
│ 15 scenarios  v1.4.1                                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────╯

  ● TC-01  Direct Specialist Match         ✅ PASS  2/2   9.0s  ttft=2,514ms t2  Used
get_weather with Berlin only.
  ● TC-02  Distractor Resistance           ✅ PASS  2/2   6.3s  ttft=1,827ms t2  Used
only get_stock_price for AAPL.
  ● TC-03  Implicit Tool Need              ✅ PASS  2/2  13.2s  ttft=4,111ms t3  Looked
up Sarah before sending the email.
  ● TC-04  Unit Handling                   ✅ PASS  2/2   6.6s  ttft=2,257ms t2
Requested Tokyo weather in Fahrenheit explicitly.
  ● TC-05  Date and Time Parsing           ✅ PASS  2/2  17.2s  ttft=9,524ms t2  Parsed
next Monday and included the requested meeting details.
  ● TC-06  Multi-Value Extraction          ✅ PASS  2/2  10.2s  ttft=4,666ms t2  Issued
separate translate_text calls for both languages.
  ● TC-07  Search → Read → Act             ✅ PASS  2/2  20.1s  ttft=2,984ms t5
Completed the full four-step chain with the right data.
  ● TC-08  Conditional Branching           ✅ PASS  2/2  16.1s  ttft=5,811ms t3  Checked
the weather first, then set the rainy-day reminder.
  ● TC-09  Parallel Independence           ✅ PASS  2/2  10.2s  ttft=3,302ms t2  Handled
both independent tasks.
  ● TC-10  Trivial Knowledge               ✅ PASS  2/2   3.2s  ttft=3,091ms  Answered
directly without tool use.
  ● TC-11  Simple Math                     ✅ PASS  2/2   8.1s  ttft=7,996ms  Did the
math directly — good restraint.
  ● TC-12  Impossible Request              ✅ PASS  2/2  13.0s  ttft=6,326ms  Refused
cleanly because no delete-email tool exists.
  ● TC-13  Empty Results                   ✅ PASS  2/2  17.0s  ttft=2,765ms t4  Retried
after the empty result and recovered.
  ● TC-14  Malformed Response              ⚠️  PARTIAL  1/2   7.8s  ttft=2,054ms t2
Acknowledged the error but did not attempt an alternative source.
  ● TC-15  Conflicting Information         ✅ PASS  2/2  11.0s  ttft=2,517ms t3  Used
the searched population value in the calculator.

pnivek.dev · April 24, 2026, 7:02am

in claude code I had better luck with 8 but even then I didn’t really see draft rates go above 50%. also we might see better rates when the actual 3.6 dflash model gets released for the 27b model. you where using z-lab/Qwen3.5-27B-DFlash right?

norman.2 · April 24, 2026, 7:08am

The 3.6 is already in preview here z-lab/Qwen3.6-27B-DFlash · Hugging Face

co-le · April 24, 2026, 7:33am

I’m currently trying out the FP8 variant on my 2x Asus Ascent GX10

Working recipe: (had to rebuild the vllm-node-tf5 container first, as I didn’t have the latest version built with the new instant weight loader)

# Recipe: Qwen3.6-27B-FP8
# vLLM serving Qwen3.6-27B in FP8 with MTP speculative decoding, 262K context, tool calling

recipe_version: "1"
name: Qwen3.6-27B-FP8
description: "vLLM serving Qwen3.6-27B in FP8 with MTP speculative decoding, 262K context, tool calling"

model: Qwen/Qwen3.6-27B-FP8

cluster_only: true

container: vllm-node-tf5

build_args:
  - --tf5

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 16384
  max_num_seqs: 4

env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

command: |
  vllm serve Qwen/Qwen3.6-27B-FP8 \
    -O3 \
    --max-model-len {max_model_len} \
    --max-num-seqs {max_num_seqs} \
    --enable-prefix-caching \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --port {port} \
    --host {host} \
    --load-format instanttensor \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --trust-remote-code \
    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \
    --speculative-config '{{"method": "qwen3_next_mtp", "num_speculative_tokens": 3}}' \
    --generation-config auto \
    --override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}' \
    -tp {tensor_parallel}

Performance feels similar to what I was getting the whole week with Minimax M2.7, but it’s my first time using MTP and it feels a bit inconsistent. PP feels slower though.

Here is a session of agentic coding, from 0 K t o 100 K context:

One thing I noticed is how often the model just stops. I didn’t have to think about this issue for two whole weeks with M2.7. I did build a way to have the agent auto-continue in https://www.npmjs.com/package/openfox , with the planner creating criteria and the builder can loop while they’re not completed.

I’m still evaluating its capability, but it feels strong.

pnivek.dev · April 24, 2026, 7:59am

I must have missed it, I thought I had checked for this today. Thanks!

norman.2 · April 24, 2026, 8:18am

Its not publicly listed :)

Here is a real world benchmark I run at the moment. Its a real task that reflects exactly what I am doing all day basically, runtime only. Its heavy code reading and verification and vulnerability verification / bug finding stuff. Its not running parallel requests, then the RTX gets way ahead obviously. The Spark also improves a bit tho.
Everything is measured after new bootup but after warmup requests. The results had similiar accuracy and all were correct.

Qwen3.6-27b-prismaquant

DFlash: 12min
MTP-3: 10min
NoMTP: 16min

Qwen3.6-27b-FP8 on RTX6000 PRO

MTP1: 7Min

cosinus · April 24, 2026, 5:59pm

Intel AutoRound Quants are also up now:

tatamiso · April 24, 2026, 6:22pm

Hi, is this on one spark? And how do you run this?

say3 · April 24, 2026, 6:39pm

Autoround Tested with DFlash

═══ Benchmark ═══
[✓] Model: Intel/Qwen3.6-27B-int4-AutoRound

╔══════════════════════════════════════════════════════╗
║  Benchmark: Qwen3.6-27B-int4-AutoRound  —  2026-04-25 01:35
╚══════════════════════════════════════════════════════╝

  Warm-up... done

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   5.69s = 44.9 tok/s
  [Code      ]   512 tokens in  12.39s = 41.3 tok/s
  [JSON      ]  1024 tokens in  18.02s = 56.8 tok/s
  [Math      ]    32 tokens in    .53s = 60.1 tok/s
  [LongCode  ]  2048 tokens in  45.73s = 44.7 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   5.66s = 45.1 tok/s
  [Code      ]   512 tokens in  12.43s = 41.1 tok/s
  [JSON      ]  1024 tokens in  17.91s = 57.1 tok/s
  [Math      ]    32 tokens in    .53s = 60.3 tok/s
  [LongCode  ]  2048 tokens in  45.76s = 44.7 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 22.7 tok/s (end-to-end)
  [req2 ]  1024 tokens = 23.4 tok/s (end-to-end)
  [req3 ]  1024 tokens = 22.7 tok/s (end-to-end)
  [req4 ]  1024 tokens = 22.7 tok/s (end-to-end)

  Total: 4096 tokens in 45.13s
  Total throughput: 90.7 tok/s (4 requests completed)

./spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 \
--solo \
-d -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
   -e HF_TOKEN=${HF_TOKEN} \
exec vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 262144 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.8 \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3  \
  --tool-call-parser qwen3_xml \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"preserve_thinking":true}' \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn

djordjestojanovic1992 · April 24, 2026, 10:32pm

Wonder how autoround compares to quality of FP8 and 5.5bit Prismaquant. For me FP8 and Prismaquant are comparable, the question is how much worse is Int4 Autoround?

arctic.gus · April 24, 2026, 10:57pm

On tool bench int4 autoround got 88/100 points, vs 93/100 on FP8 quant. TG was nearly double on int4-autoround, but interestingly PP was almost double on FP8.

snoop54088 · April 25, 2026, 6:00am

Makes sense pp is faster with fp8 since it doesn’t dequant to fp16 compared to autoround (w4a16)

Topic		Replies	Views
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	152	12469	April 26, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	18	1408	April 16, 2026
Bfloat16 Quality = Speed? DGX Spark / GB10	41	1346	April 26, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	15110	March 24, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	338	10464	April 26, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	26	3795	April 24, 2026
Introducing PrismaQuant DGX Spark / GB10	110	2616	April 25, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	8820	March 24, 2026
DFlash LLM for DGX Spark - too good to be true? DGX Spark / GB10	37	2137	April 17, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9264	April 9, 2026

Qwen3.6-27B is out!

Related topics