Qwen3.6-27B is out!

serapis · May 26, 2026, 11:39am

The PR is not merged in vllm itself – a fork of vllm apparently pulled and merged the PR.

azampatti · May 26, 2026, 3:17pm

That’s a solid tool benchmark. Which chat template are you using here?

The only observation here is the median turn at 4.3s might be a bit slow for some interactive work.

azampatti · May 26, 2026, 5:46pm

I rebuilt everything from scratch and re-tested.

My custom dFlash build:

  [Q&A] 256 tokens in 4.77s = 53.6 tok/s (prompt: 23)
  [Code] 512 tokens in 8.90s = 57.5 tok/s (prompt: 30)
  [JSON] 1024 tokens in 13.79s = 74.2 tok/s (prompt: 48)
  [Math] 64 tokens in .77s = 83.1 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 31.41s = 65.2 tok/s (prompt: 37)

New –-tf5 build with no extra PRs:

  [Q&A] 256 tokens in 4.58s = 55.8 tok/s (prompt: 23)
  [Code] 512 tokens in 8.60s = 59.5 tok/s (prompt: 30)
  [JSON] 1024 tokens in 13.49s = 75.9 tok/s (prompt: 48)
  [Math] 64 tokens in 1.18s = 54.2 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 26.27s = 77.9 tok/s (prompt: 37)

Looks like we loose in Math, and we win in LongCode (I ran these three times each, so it’s consistent).

I will test out in my Claude Code workflow which is where normally DFlash falls to the floor for me when pushing the context to the limits.

bernisse · May 26, 2026, 6:10pm

This was my finding too. I rebuilt from source and found dflash working well without the PR so that’s when I checked the PR status and incorrectly read it had been merged. Something changed because before this dflash did not work well without the PR pulled in. I am using the v6 of the chat template discussed above. I’ll send the link later if you can’t find it. I have been using this since yesterday and it’s been ok but I have not pushed it hard. It sure is difficult to keep track of what is going on inside of VLLM. One thing gets fixed and another is regressed.

njzc · May 26, 2026, 8:31pm

Thanks.

The v6 chat template works and the score is 93.

peter99512 · May 27, 2026, 12:51am

can you share your recipe and what model you are running?

zoliomarling · May 27, 2026, 1:02am

At the same time, the token generation speed of my Qwen3.6-27B reached 100.4 tokens per second.

There are my parameters:

MODEL_PATH="/home/ai/Illusionna/Desktop/models/Qwen3.6-27B-INT4-AutoRound-Intel"
MODEL_NAME=$(basename "$MODEL_PATH")
VLLM_IMAGE="dgx-spark-gb10:vllm0.19.1.dev"

docker run \
    --rm -it \
    --gpus all \
    --name ${MODEL_NAME} \
    -v ${MODEL_PATH}:/models/${MODEL_NAME} \
    -p 30000:30000 \
    --ipc=host \
    --entrypoint vllm \
        --health-cmd 'curl -sf http://localhost:30000/health || exit 1' \
        --health-interval 10s \
        --health-timeout 5s \
        --health-retries 60 \
        --health-start-period 900s \
    ${VLLM_IMAGE} \
        serve /models/${MODEL_NAME} \
        --served-model-name ${MODEL_NAME} \
        --host 0.0.0.0 \
        --port 30000 \
        --gpu-memory-utilization 0.8 \
        --max-model-len 262144 \
        --max-num-seqs 4 \
        --max-num-batched-tokens 16384 \
        --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' \
        --reasoning-parser qwen3 \
        --tool-call-parser qwen3_coder \
        --load-format instanttensor \
        --attention-backend flashinfer \
        --kv-cache-dtype fp8 \
        --quantization gptq_marlin \
        --enable-prefix-caching \
        --enable-chunked-prefill \
        --enable-auto-tool-choice \
        --trust-remote-code

arctic.gus · May 27, 2026, 6:19am

Downside with INT4 Autoround is slow prompt processing speeds on 27b dense (seems fine on 35b MOE though, INT8 Autoround hits 10k t/s PP on dual sparks for me, INT4 goes even higher). NVFP4 / Paroquant on 27b can hit 3000+ t/s on a single spark while having similar TG to int4.

azampatti · May 27, 2026, 7:09pm

My current BEST performing recipe is for Qwen3.6-35B-A3B with the latest vLLM (or the DFlash optimized, they trade blows in different tests) + Chat Template 3.6 v9 + Introducing vLLM-Tune — Kernel tuning CLI for vLLM on DGX Spark

I get solid 93/100 score in Tool Eval Bench with nice performance and parallelism.

mike136 · May 27, 2026, 8:18pm

what are you running to get this benchmark output?

[Q&A] 256 tokens in 4.77s = 53.6 tok/s (prompt: 23)

[Code] 512 tokens in 8.90s = 57.5 tok/s (prompt: 30)
[JSON] 1024 tokens in 13.79s = 74.2 tok/s (prompt: 48)
[Math] 64 tokens in .77s = 83.1 tok/s (prompt: 29)
[LongCode] 2048 tokens in 31.41s = 65.2 tok/s (prompt: 37)

azampatti · May 27, 2026, 8:30pm

I’m running a slightly modified version from @Albond 's version from GitHub - albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4: Qwen3.5-122B-A10B on DGX Spark: 28.3 → 51 tok/s (+80%) · GitHub

I added parallelism and model detection, but the single-user results are exactly the same if you compare with that output :)

mike136 · May 27, 2026, 8:36pm

No what I meant is what benchmark is producing that output.. I am using tool-eval-bench and have not seen that kind of output format

azampatti · May 27, 2026, 8:46pm

I understood perfectly. Look at Albond’s repo, you’ll find the bench shell script that I’m using

peter99512 · May 27, 2026, 10:45pm

I see, so you use fp8?

azampatti · May 27, 2026, 11:37pm

Yes. 99% of the times, yes :)

My exception is when I use albond’s hybrid qwen3.5-122b-a10b.

I’ve been trying PrismaQuant for both 27B and 35B but the results in my workflow are ‘iffy’, so I always go back to 3.6-35B-A3B FP8. I do move a lot between DFlash and MTP. They both are trade the crown on my two main workflows, but overall they are very similar in terms of behavior (MTP wins in LONG context work, and DFlash in parallelism).

peter99512 · May 28, 2026, 1:49am

where you get the Chat Template 3.6 v9 +? my fp8 score is not good compare to 27b. I think my chat template is different than yours.

wingsgb89 · May 28, 2026, 2:49am

I am seeing great results during opencode sessions, i ran multi hour coding sessions yesterday without any issues for the first time.

I am using the default recipe and chat v6 template

Model:  Qwen/Qwen3.6-35B-A3B-FP8
Score:  97 / 100
Rating: ★★★★★ Excellent
Engine:       vLLM 0.20.1rc1.dev172+g3a6bf961c.d20260503
Quantization: FP8
Max context:  262,144 tokens

✅ 14 passed   ⚠️  1 partial   ❌ 0 failed                                                                       
Points: 29/30

Quality:        97/100                                                                                           
Responsiveness: 57/100  (median turn: 2.5s)                                                                      
Deployability:  85/100  (α=0.7)                                                                                  
Weakest: D Restraint & Refusal (83%)

Completed in 104.6s  │  tool-eval-bench v1.8.0

 Token Usage:
Total: 70,841 tokens  │  Efficiency: 0.4 pts/1K tokens

jaekil2 · May 28, 2026, 7:09am

Recipe please.

I saw 95 point once. But I’ve never seen such point.

arctic.gus · May 28, 2026, 7:16am

That 97 score is only with --short switch, so 15 tests only. To get a more accurate score you would want to run the full 74 tests with --hardmode.

wingsgb89 · May 28, 2026, 8:56am

correct, --hardmode bench score dropped 92/100. I was more highlighting that with the latest updates my session are becoming noticeably more reliable. Previously i have struggled with wrong or failed tool calls and endless loops.

From hands on use I’ve noticed when session context starts to reach 70-80k + tokens, i feel the decline in output quality.

Topic		Replies	Views
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	307	25077	June 4, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	23	2650	May 11, 2026
Qwen3.6-27B-Dflash link DGX Spark / GB10 Projects	22	3983	April 29, 2026
Fastest Qwen 3.5 122B Int4 recipe on DGX Spark tested and published on Spark-Arena DGX Spark / GB10 llama	59	2163	June 3, 2026
Bfloat16 Quality = Speed? DGX Spark / GB10	106	5188	May 26, 2026
DFlash LLM for DGX Spark - too good to be true? DGX Spark / GB10	37	3141	April 17, 2026
Step-3.7-Flash is supported in community Docker on DGX Spark! DGX Spark / GB10	51	2734	June 3, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	9863	March 24, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	16390	March 24, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	412	18890	June 2, 2026

Qwen3.6-27B is out!

Related topics