Qwen3.6-27B is out!

Qwen3.6-27B is out ! i hope for 122B now

This looks awesome. I was already impressed by the 3.5 27B model. By the stats, itโ€™s a clean step forward everywhere except a teeny regression in one STEM benchmark (probably too close to call honestly).

FP8 with MTP on the 3.5 version was running about 12 tok/s for me. Will be interesting to see if the MTP has improved on 27B like it seems to have on 35B-A3B.

Against Opus 4.5, a 27b model, as Jensen already said, the next two decades will be incredible.

Iโ€™ve test 3.6 on both single and duo sparks, the speed is the same compare to 3.5: TPOP (290 , 137) in ms.

I tried the FP8 version on my Dual Node Cluster. This one will highly benefit from the typical Intel or cyankiwi treatment:

vllm serve Qwen/Qwen3.6-27B-FP8 \
    --host 0.0.0.0 \
    --port 8080 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 16384 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-seqs 4 \
    --load-format instanttensor \
    --attention-backend flashinfer \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --served-model-name Qwen3.6-27B \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --override-generation-config "{\"temperature\": 0.6, \"top_p\": 0.95, \"top_k\": 20, \"min_p\": 0.0, \"presence_penalty\": 0.0, \"repetition_penalty\": 1.0}" \
    --default-chat-template-kwargs '{"preserve_thinking": true}' \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray

llama-benchy Results


โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Test                                       โ”ƒ     c      โ”ƒ               pp t/s โ”ƒ               tg t/s โ”ƒ              TTFT (ms) โ”ƒ             Total (ms) โ”ƒ                Tokens โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ pp2048 tg128 @ d0                          โ”‚     c1     โ”‚                3,067 โ”‚                 14.4 โ”‚                    761 โ”‚                  9,551 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d0                          โ”‚     c2     โ”‚                2,007 โ”‚                 25.8 โ”‚                  1,584 โ”‚                 10,904 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d0                          โ”‚     c4     โ”‚                1,036 โ”‚                 41.2 โ”‚                  7,560 โ”‚                 17,632 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096                       โ”‚     c1     โ”‚                1,628 โ”‚                 14.4 โ”‚                  3,868 โ”‚                 12,666 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096                       โ”‚     c2     โ”‚                  920 โ”‚                 13.7 โ”‚                  8,619 โ”‚                 20,701 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096                       โ”‚     c4     โ”‚                  895 โ”‚                 16.5 โ”‚                 19,063 โ”‚                 32,263 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192                       โ”‚     c1     โ”‚                1,590 โ”‚                 14.3 โ”‚                  6,535 โ”‚                 15,384 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192                       โ”‚     c2     โ”‚                  861 โ”‚                  9.6 โ”‚                 15,172 โ”‚                 28,682 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192                       โ”‚     c4     โ”‚                  757 โ”‚                 10.7 โ”‚                 44,931 โ”‚                 57,299 โ”‚              2048+128 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

tool-eval-bench Results

                                                                                Category Breakdown
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Category                                                     โ”ƒ          Score           โ”ƒ Bar                                                         โ”ƒ         Earned          โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Tool Selection                                               โ”‚           100%           โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        โ”‚           6/6           โ”‚
โ”‚ Parameter Precision                                          โ”‚           100%           โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        โ”‚           6/6           โ”‚
โ”‚ Multi-Step Chains                                            โ”‚           100%           โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        โ”‚           6/6           โ”‚
โ”‚ Restraint & Refusal                                          โ”‚           100%           โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        โ”‚           6/6           โ”‚
โ”‚ Error Recovery                                               โ”‚           100%           โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        โ”‚           6/6           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ† Benchmark Complete โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    Model:  Qwen/Qwen3.6-27B-FP8                                                                                                                                                 โ”‚
โ”‚    Score:  100 / 100                                                                                                                                                            โ”‚
โ”‚    Rating: โ˜…โ˜…โ˜…โ˜…โ˜… Excellent                                                                                                                                                      โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    โœ… 15 passed   โš ๏ธ  0 partial   โŒ 0 failed                                                                                                                                   โ”‚
โ”‚    Points: 30/30                                                                                                                                                                โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    Quality:        100/100                                                                                                                                                      โ”‚
โ”‚    Responsiveness: 17/100  (median turn: 8.7s)                                                                                                                                  โ”‚
โ”‚    Deployability:  75/100  (ฮฑ=0.7)                                                                                                                                              โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    Completed in 398.5s                                                                                                                                                          โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    ๐Ÿ“Š Token Usage:                                                                                                                                                              โ”‚
โ”‚    Total: 37,754 tokens  โ”‚  Efficiency: 0.8 pts/1K tokens                                                                                                                       โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    โšก Throughput:                                                                                                                                                               โ”‚
โ”‚    Single:  3,067 pp t/s  โ”‚  14.4 tg t/s  โ”‚  TTFT 761ms                                                                                                                         โ”‚
โ”‚    c2:      2,007 pp t/s  โ”‚  25.8 tg t/s                                                                                                                                        โ”‚
โ”‚    c4:      1,036 pp t/s  โ”‚  41.2 tg t/s                                                                                                                                        โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    โ”€โ”€ How this score is calculated โ”€โ”€                                                                                                                                           โ”‚
โ”‚    โ€ข Each scenario: pass=2pt, partial=1pt, fail=0pt                                                                                                                             โ”‚
โ”‚    โ€ข Category %: earned / max per category                                                                                                                                      โ”‚
โ”‚    โ€ข Final score: (total points / max points) ร— 100                                                                                                                             โ”‚
โ”‚    โ€ข Deployability: 0.7ร—quality + 0.3ร—responsiveness                                                                                                                            โ”‚
โ”‚    โ€ข Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s)                                                                                                          โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Youโ€™ll really want to enable the built in MTP.

In the end weโ€™ll have VRAM left over.

With one single Spark:

docker run -d
โ€“privileged --name qwen3.6-27B-FP8
โ€“gpus all
โ€“network host --ipc=host
-v ~/.cache/huggingface:/root/.cache/huggingface
vllm-node
vllm serve Qwen/Qwen3.6-27B-FP8
โ€“host 0.0.0.0
โ€“port 8080
โ€“tensor-parallel-size 1
โ€“gpu-memory-utilization 0.75
โ€“max-model-len 32768
โ€“max-num-batched-tokens 16384
โ€“enable-prefix-caching
โ€“enable-chunked-prefill
โ€“max-num-seqs 4
โ€“load-format auto
โ€“attention-backend flashinfer
โ€“dtype auto
โ€“kv-cache-dtype fp8
โ€“trust-remote-code
โ€“enable-auto-tool-choice
โ€“served-model-name Qwen3.6-27B-FP8
โ€“tool-call-parser qwen3_coder
โ€“reasoning-parser qwen3
โ€“override-generation-config โ€˜{โ€œtemperatureโ€: 0.6, โ€œtop_pโ€: 0.95, โ€œtop_kโ€: 20, โ€œmin_pโ€: 0.0}โ€™
โ€“default-chat-template-kwargs โ€˜{โ€œpreserve_thinkingโ€: true}โ€™

tool-eval-bench Results

What is the maximum usable context with a single unit?
At the moment I am using qwen3-next-coder with 256K of context using 106GB, the 3.6 27B should free up a lot of RAMโ€ฆ

Dual Node with MTP. I am still testing this but it seems like each speculative token adds an allgather across the inter-node link. With num_speculative_tokens=2 , thatโ€™s 2 extra cross-node round trips per decode step on top of the normal allreduce โ€” likely 2-3ร— the communication overhead, eating any speedup from speculation โ€“ @eugr may have smart ideas how that could be tackled:

vllm serve Qwen/Qwen3.6-27B-FP8 \
    --host 0.0.0.0 \
    --port 8080 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 16384 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-seqs 4 \
    --load-format instanttensor \
    --attention-backend flashinfer \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --served-model-name Qwen3.6-27B \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --override-generation-config "{\"temperature\": 0.6, \"top_p\": 0.95, \"top_k\": 20, \"min_p\": 0.0, \"presence_penalty\": 0.0, \"repetition_penalty\": 1.0}" \
    --default-chat-template-kwargs '{"preserve_thinking": true}' \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 2 \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray

Results:

๐Ÿ”ง Tool-Call Benchmark
  Server: http://0.0.0.0:8080
  Querying http://0.0.0.0:8080/v1/models โ€ฆ โœ“ Qwen/Qwen3.6-27B-FP8 (alias: Qwen3.6-27B)

  โœ“ Warm-up complete (3280 ms)

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โšก llama-benchy Throughput Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Qwen/Qwen3.6-27B-FP8                                                                              โ”‚
โ”‚ pp=[2048]  tg=[128]  depth=[0, 4096, 8192]  concurrency=[1, 2, 4]  runs=3  latency=generation     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

  โœ“ Complete โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 27/27 0:09:46

  llama-benchy 0.3.5
  Estimated latency: 181.1 ms

                                        llama-benchy Results
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Test                   โ”ƒ   c   โ”ƒ     pp t/s โ”ƒ     tg t/s โ”ƒ   TTFT (ms) โ”ƒ  Total (ms) โ”ƒ     Tokens โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ pp2048 tg128 @ d0      โ”‚  c1   โ”‚        902 โ”‚        7.2 โ”‚       2,454 โ”‚      20,093 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d0      โ”‚  c2   โ”‚      1,111 โ”‚       12.7 โ”‚       3,870 โ”‚      22,416 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d0      โ”‚  c4   โ”‚      1,213 โ”‚       20.9 โ”‚       7,117 โ”‚      29,162 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096   โ”‚  c1   โ”‚      1,364 โ”‚        7.8 โ”‚       4,688 โ”‚      20,977 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096   โ”‚  c2   โ”‚        920 โ”‚       11.1 โ”‚      12,133 โ”‚      31,592 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096   โ”‚  c4   โ”‚        808 โ”‚       20.2 โ”‚      27,813 โ”‚      48,387 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192   โ”‚  c1   โ”‚      1,392 โ”‚        7.7 โ”‚       7,630 โ”‚      24,175 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192   โ”‚  c2   โ”‚        933 โ”‚        7.2 โ”‚      17,000 โ”‚      40,788 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192   โ”‚  c4   โ”‚        681 โ”‚        4.7 โ”‚      47,587 โ”‚      78,185 โ”‚   2048+128 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  โ„น Metrics sourced from llama-benchy โ€” see https://github.com/eugr/llama-benchy for methodology.


โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ”ฎ Speculative Decoding Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Qwen/Qwen3.6-27B-FP8                                                                              โ”‚
โ”‚ tg=128  depth=[0, 4096, 8192]  prompts=['filler', 'code', 'structured']  method=auto              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
Prometheus /metrics acceptance-rate counters are server-wide aggregates. If other models are serving concurrent traffic on this endpoint, per-request acceptance rate measurements will be inaccurate. For clean measurements: use a single-model server with no concurrent load.
  โœ“     filler @ d0  17.0 eff t/s  16.9 stream t/s  ฮฑ=88.0%  ฯ„=1.8
  โœ“       code @ d0  19.2 eff t/s  19.1 stream t/s  ฮฑ=94.3%  ฯ„=1.9
  โœ“ structured @ d0  18.1 eff t/s  17.9 stream t/s  ฮฑ=85.1%  ฯ„=1.7
  โœ“     filler @ d4096  11.8 eff t/s  11.7 stream t/s  ฮฑ=90.2%  ฯ„=1.8
  โœ“       code @ d4096  20.4 eff t/s  20.2 stream t/s  ฮฑ=94.3%  ฯ„=1.9
  โœ“ structured @ d4096  18.8 eff t/s  18.7 stream t/s  ฮฑ=85.1%  ฯ„=1.7
  โœ“     filler @ d8192  10.1 eff t/s  10.0 stream t/s  ฮฑ=85.1%  ฯ„=1.7
  โœ“       code @ d8192  20.9 eff t/s  20.8 stream t/s  ฮฑ=94.3%  ฯ„=1.9
  โœ“ structured @ d8192  19.1 eff t/s  19.0 stream t/s  ฮฑ=85.1%  ฯ„=1.7

                   Speculative Decoding Results
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Prompt     โ”ƒ Depth โ”ƒ Eff t/s โ”ƒ   ฮฑ % โ”ƒ ฯ„ len โ”ƒ TTFT โ”ƒ Total ms โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ filler     โ”‚     0 โ”‚    17.0 โ”‚ 88.0% โ”‚   1.8 โ”‚   10 โ”‚    7,530 โ”‚
โ”‚ code       โ”‚     0 โ”‚    19.2 โ”‚ 94.3% โ”‚   1.9 โ”‚    7 โ”‚    6,661 โ”‚
โ”‚ structured โ”‚     0 โ”‚    18.1 โ”‚ 85.1% โ”‚   1.7 โ”‚    8 โ”‚    7,084 โ”‚
โ”‚ filler     โ”‚    4K โ”‚    11.8 โ”‚ 90.2% โ”‚   1.8 โ”‚   20 โ”‚   10,898 โ”‚
โ”‚ code       โ”‚    4K โ”‚    20.4 โ”‚ 94.3% โ”‚   1.9 โ”‚    6 โ”‚    6,279 โ”‚
โ”‚ structured โ”‚    4K โ”‚    18.8 โ”‚ 85.1% โ”‚   1.7 โ”‚    8 โ”‚    6,815 โ”‚
โ”‚ filler     โ”‚    8K โ”‚    10.1 โ”‚ 85.1% โ”‚   1.7 โ”‚   19 โ”‚   12,680 โ”‚
โ”‚ code       โ”‚    8K โ”‚    20.9 โ”‚ 94.3% โ”‚   1.9 โ”‚    7 โ”‚    6,127 โ”‚
โ”‚ structured โ”‚    8K โ”‚    19.1 โ”‚ 85.1% โ”‚   1.7 โ”‚    8 โ”‚    6,699 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  Highest acceptance: code (94.3%)  Lowest: structured (85.1%)

  ๐Ÿ“„ Report saved to
/home/tim/.local/share/uv/tools/tool-eval-bench/lib/python3.12/runs/2026/04/2026-04-22T19-39-05Z_86b6
57.md


โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ”ง Tool-Call Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Qwen/Qwen3.6-27B-FP8  via vllm @ http://0.0.0.0:8080                                              โ”‚
โ”‚ 15 scenarios                                                                                      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

  โ— TC-01  Direct Specialist Match         โœ… PASS  2/2  15.8s  ttft=3,097ms t2  Used get_weather
with Berlin only.
  โ— TC-02  Distractor Resistance           โœ… PASS  2/2  11.4s  ttft=3,943ms t2  Used only
get_stock_price for AAPL.
  โ— TC-03  Implicit Tool Need              โœ… PASS  2/2  20.0s  ttft=6,678ms t3  Looked up Sarah
before sending the email.
  โ— TC-04  Unit Handling                   โœ… PASS  2/2   9.0s  ttft=2,993ms t2  Requested Tokyo
weather in Fahrenheit explicitly.
  โ— TC-05  Date and Time Parsing           โœ… PASS  2/2  36.3s  ttft=13,289ms t3  Parsed next Monday
and included the requested meeting details.
  โ— TC-06  Multi-Value Extraction          โœ… PASS  2/2  47.9s  ttft=33,088ms t3  Issued separate
translate_text calls for both languages.
  โ— TC-07  Search โ†’ Read โ†’ Act             โœ… PASS  2/2  37.5s  ttft=6,606ms t5  Completed the full
four-step chain with the right data.
  โ— TC-08  Conditional Branching           โœ… PASS  2/2  30.4s  ttft=10,393ms t3  Checked the weather
first, then set the rainy-day reminder.
  โ— TC-09  Parallel Independence           โœ… PASS  2/2  23.1s  ttft=5,234ms t2  Handled both
independent tasks.
  โ— TC-10  Trivial Knowledge               โœ… PASS  2/2  10.0s  ttft=7,546ms  Answered directly
without tool use.
  โ— TC-11  Simple Math                     โœ… PASS  2/2  24.4s  ttft=23,737ms  Did the math directly.
  โ— TC-12  Impossible Request              โœ… PASS  2/2  15.9s  ttft=8,466ms  Refused cleanly because
no delete-email tool exists.
  โ— TC-13  Empty Results                   โœ… PASS  2/2  15.5s  ttft=2,961ms t3  Retried after the
empty result and recovered.
  โ— TC-14  Malformed Response              โœ… PASS  2/2  11.5s  ttft=2,998ms t2  Acknowledged the
stock tool failure and handled it gracefully.
  โ— TC-15  Conflicting Information         โœ… PASS  2/2  23.3s  ttft=3,870ms t3  Used the searched
population value in the calculator.

                                         Category Breakdown
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Category                         โ”ƒ     Score     โ”ƒ Bar                              โ”ƒ   Earned    โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Tool Selection                   โ”‚     100%      โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             โ”‚     6/6     โ”‚
โ”‚ Parameter Precision              โ”‚     100%      โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             โ”‚     6/6     โ”‚
โ”‚ Multi-Step Chains                โ”‚     100%      โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             โ”‚     6/6     โ”‚
โ”‚ Restraint & Refusal              โ”‚     100%      โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             โ”‚     6/6     โ”‚
โ”‚ Error Recovery                   โ”‚     100%      โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             โ”‚     6/6     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ† Benchmark Complete โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                                                                                   โ”‚
โ”‚    Model:  Qwen/Qwen3.6-27B-FP8                                                                   โ”‚
โ”‚    Score:  100 / 100                                                                              โ”‚
โ”‚    Rating: โ˜…โ˜…โ˜…โ˜…โ˜… Excellent                                                                        โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    โœ… 15 passed   โš ๏ธ  0 partial   โŒ 0 failed                                                     โ”‚
โ”‚    Points: 30/30                                                                                  โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    Quality:        100/100                                                                        โ”‚
โ”‚    Responsiveness: 20/100  (median turn: 7.4s)                                                    โ”‚
โ”‚    Deployability:  76/100  (ฮฑ=0.7)                                                                โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    Completed in 332.0s                                                                            โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    ๐Ÿ“Š Token Usage:                                                                                โ”‚
โ”‚    Total: 40,561 tokens  โ”‚  Efficiency: 0.7 pts/1K tokens                                         โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    โšก Throughput:                                                                                 โ”‚
โ”‚    Single:  1,392 pp t/s  โ”‚  7.8 tg t/s  โ”‚  TTFT 4,688ms                                          โ”‚
โ”‚    c2:      1,111 pp t/s  โ”‚  12.7 tg t/s                                                          โ”‚
โ”‚    c4:      1,213 pp t/s  โ”‚  20.9 tg t/s                                                          โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    โ”€โ”€ How this score is calculated โ”€โ”€                                                             โ”‚
โ”‚    โ€ข Each scenario: pass=2pt, partial=1pt, fail=0pt                                               โ”‚
โ”‚    โ€ข Category %: earned / max per category                                                        โ”‚
โ”‚    โ€ข Final score: (total points / max points) ร— 100                                               โ”‚
โ”‚    โ€ข Deployability: 0.7ร—quality + 0.3ร—responsiveness                                              โ”‚
โ”‚    โ€ข Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s)                            โ”‚
โ”‚                                                                                                   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

I ran the same subset of AgentBench with the FP8 version, and amazingly it was faster than the MoE (also FP8) version. I can only assume it went in less circles or generated less errors calling tools.

(I also canโ€™t explain why the FP8 MoE beat the bf16 MoE, but I ran them both multiple times, and each run took the mean of 3 epochs - the results were oddly consistent)

Iโ€™ll try to kick off the bf16 version soon.

Iโ€™m thinking that, although there are improvements over 3.5, overall the model is just thinking too much and getting bogged down without the prospect of an actual result.

Testing with cyankiwi/Qwen3.6-27B-AWQ-INT4 I get decent responses from fairly simple prompts, but when I throw it at a real-world complex coding problem, it fails.

I have quite a complex graphic program running, with an obvious bug that needs fixing. Qwen 3.6 attacked the problem with tokens being generated at a good speed. But it became obvious that with the endless thinking, that the problem was too complex for it to deal with. I gave it a good amount of time to get somewhere, but after 20 minutes or so, gave up.

On the other hand, Minimax M2.7 tackled the problem with a decent amount of thinking time, but came up with a solution, tested it with Playwright, found an error and then finished with a working system with the bug resolved.

The Qwen 3.6 models may be getting great benchmark scores, but Iโ€™m not seeing this translate into being useful on complex coding problems.

Which coding agent do you use? Claude Code? OpenCode? VS Code CoPilot?

Opencode

I posted the updated results here:

I just tried to run a similar llama-bency as posted by @serapis , for both Qwen3.6-35B-A3B and -27B.

Running llama-cpp b8672 on 2 NV-linked RTX3090.
Using Unslothโ€™s UD-Q4_K_XL quantizations for both models.

Qwen3.6-27B

llama-benchy --base-url "http://127.0.0.1:8080/v1" --model "Qwen/Qwen3.6-27B" --pp 2048 --tg 128 --depth 0 4096 8192 --concurrency 1 2 4 --latency-mode generation --runs 1
model test t/s (total) t/s (req) peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.6-27B pp2048 (c1) 1023.91 ยฑ 0.00 1023.91 ยฑ 0.00 2194.61 ยฑ 0.00 2000.18 ยฑ 0.00 2194.68 ยฑ 0.00
Qwen/Qwen3.6-27B tg128 (c1) 37.95 ยฑ 0.00 37.95 ยฑ 0.00 39.00 ยฑ 0.00 39.00 ยฑ 0.00
Qwen/Qwen3.6-27B pp2048 (c2) 992.41 ยฑ 0.00 520.82 ยฑ 0.02 4128.59 ยฑ 0.15 3934.15 ยฑ 0.15 4128.64 ยฑ 0.16
Qwen/Qwen3.6-27B tg128 (c2) 59.30 ยฑ 0.00 29.65 ยฑ 0.00 60.00 ยฑ 0.00 30.00 ยฑ 0.00
Qwen/Qwen3.6-27B pp2048 (c4) 621.70 ยฑ 0.00 316.32 ยฑ 158.54 8843.05 ยฑ 4334.70 8648.61 ยฑ 4334.70 8843.10 ยฑ 4334.69
Qwen/Qwen3.6-27B tg128 (c4) 39.08 ยฑ 0.00 29.45 ยฑ 0.10 60.00 ยฑ 0.00 30.00 ยฑ 0.00
Qwen/Qwen3.6-27B pp2048 @ d4096 (c1) 1274.32 ยฑ 0.00 1274.32 ยฑ 0.00 5016.60 ยฑ 0.00 4822.17 ยฑ 0.00 5016.67 ยฑ 0.00
Qwen/Qwen3.6-27B tg128 @ d4096 (c1) 36.94 ยฑ 0.00 36.94 ยฑ 0.00 38.00 ยฑ 0.00 38.00 ยฑ 0.00
Qwen/Qwen3.6-27B pp2048 @ d4096 (c2) 1107.23 ยฑ 0.00 644.19 ยฑ 80.65 9885.44 ยฑ 1213.32 9691.00 ยฑ 1213.32 9885.50 ยฑ 1213.33
Qwen/Qwen3.6-27B tg128 @ d4096 (c2) 37.01 ยฑ 0.00 23.70 ยฑ 4.92 58.00 ยฑ 0.00 29.00 ยฑ 0.00
Qwen/Qwen3.6-27B pp2048 @ d4096 (c4) 861.25 ยฑ 0.00 427.59 ยฑ 204.70 18563.00 ยฑ 8499.76 18368.56 ยฑ 8499.76 18563.06 ยฑ 8499.76
Qwen/Qwen3.6-27B tg128 @ d4096 (c4) 20.96 ยฑ 0.00 18.92 ยฑ 5.19 58.00 ยฑ 0.00 29.00 ยฑ 0.00
Qwen/Qwen3.6-27B pp2048 @ d8192 (c1) 1007.22 ยฑ 0.00 1007.22 ยฑ 0.00 10361.01 ยฑ 0.00 10166.57 ยฑ 0.00 10361.09 ยฑ 0.00
Qwen/Qwen3.6-27B tg128 @ d8192 (c1) 35.31 ยฑ 0.00 35.31 ยฑ 0.00 37.00 ยฑ 0.00 37.00 ยฑ 0.00
Qwen/Qwen3.6-27B pp2048 @ d8192 (c2) 882.77 ยฑ 0.00 550.09 ยฑ 105.00 19515.04 ยฑ 3686.18 19320.60 ยฑ 3686.18 19515.11 ยฑ 3686.18
Qwen/Qwen3.6-27B tg128 @ d8192 (c2) 20.11 ยฑ 0.00 17.16 ยฑ 6.99 56.00 ยฑ 0.00 28.00 ยฑ 0.00
Qwen/Qwen3.6-27B pp2048 @ d8192 (c4) 769.58 ยฑ 0.00 378.33 ยฑ 183.56 34393.08 ยฑ 15234.32 34198.64 ยฑ 15234.32 34393.14 ยฑ 15234.32
Qwen/Qwen3.6-27B tg128 @ d8192 (c4) 12.01 ยฑ 0.00 13.15 ยฑ 7.12 56.00 ยฑ 0.00 28.00 ยฑ 0.00

Qwen3.6-35B-A3B

llama-benchy --base-url "http://127.0.0.1:8080/v1" --model "Qwen/Qwen3.6-35B-A3B"   --pp 2048 --tg 128 --depth 0 4096 8192 --concurrency 1 2 4 --latency-mode generation --runs 1
model test t/s (total) t/s (req) peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.6-35B-A3B pp2048 (c1) 2938.56 ยฑ 0.00 2938.56 ยฑ 0.00 802.33 ยฑ 0.00 697.28 ยฑ 0.00 802.43 ยฑ 0.00
Qwen/Qwen3.6-35B-A3B tg128 (c1) 98.33 ยฑ 0.00 98.33 ยฑ 0.00 100.00 ยฑ 0.00 100.00 ยฑ 0.00
Qwen/Qwen3.6-35B-A3B pp2048 (c2) 2589.89 ยฑ 0.00 1388.68 ยฑ 1.29 1580.91 ยฑ 1.73 1475.86 ยฑ 1.73 1580.97 ยฑ 1.72
Qwen/Qwen3.6-35B-A3B tg128 (c2) 127.89 ยฑ 0.00 64.02 ยฑ 0.00 132.00 ยฑ 0.00 66.00 ยฑ 0.00
Qwen/Qwen3.6-35B-A3B pp2048 (c4) 1487.73 ยฑ 0.00 804.26 ยฑ 424.88 3639.67 ยฑ 1867.59 3534.62 ยฑ 1867.59 3639.74 ยฑ 1867.59
Qwen/Qwen3.6-35B-A3B tg128 (c4) 88.74 ยฑ 0.00 64.05 ยฑ 0.14 132.00 ยฑ 0.00 65.50 ยฑ 0.50
Qwen/Qwen3.6-35B-A3B pp2048 @ d4096 (c1) 2984.48 ยฑ 0.00 2984.48 ยฑ 0.00 2163.70 ยฑ 0.00 2058.65 ยฑ 0.00 2163.77 ยฑ 0.00
Qwen/Qwen3.6-35B-A3B tg128 @ d4096 (c1) 95.57 ยฑ 0.00 95.57 ยฑ 0.00 97.00 ยฑ 0.00 97.00 ยฑ 0.00
Qwen/Qwen3.6-35B-A3B pp2048 @ d4096 (c2) 2732.11 ยฑ 0.00 1556.36 ยฑ 157.31 4093.76 ยฑ 402.84 3988.71 ยฑ 402.84 4093.83 ยฑ 402.83
Qwen/Qwen3.6-35B-A3B tg128 @ d4096 (c2) 89.30 ยฑ 0.00 53.87 ยฑ 8.44 128.00 ยฑ 0.00 64.00 ยฑ 0.00
Qwen/Qwen3.6-35B-A3B pp2048 @ d4096 (c4) 2190.29 ยฑ 0.00 1047.01 ยฑ 474.48 7428.85 ยฑ 3249.46 7323.80 ยฑ 3249.46 7428.90 ยฑ 3249.46
Qwen/Qwen3.6-35B-A3B tg128 @ d4096 (c4) 53.94 ยฑ 0.00 45.63 ยฑ 11.13 128.00 ยฑ 0.00 64.00 ยฑ 0.00
Qwen/Qwen3.6-35B-A3B pp2048 @ d8192 (c1) 2878.66 ยฑ 0.00 2878.66 ยฑ 0.00 3662.95 ยฑ 0.00 3557.90 ยฑ 0.00 3663.02 ยฑ 0.00
Qwen/Qwen3.6-35B-A3B tg128 @ d8192 (c1) 95.05 ยฑ 0.00 95.05 ยฑ 0.00 96.00 ยฑ 0.00 96.00 ยฑ 0.00
Qwen/Qwen3.6-35B-A3B pp2048 @ d8192 (c2) 2696.19 ยฑ 0.00 1657.18 ยฑ 289.89 6480.25 ยฑ 1115.52 6375.20 ยฑ 1115.52 6480.32 ยฑ 1115.52
Qwen/Qwen3.6-35B-A3B tg128 @ d8192 (c2) 61.23 ยฑ 0.00 48.67 ยฑ 17.57 134.00 ยฑ 0.00 67.00 ยฑ 0.00
Qwen/Qwen3.6-35B-A3B pp2048 @ d8192 (c4) 2382.20 ยฑ 0.00 1148.05 ยฑ 536.10 11214.65 ยฑ 4836.95 11109.59 ยฑ 4836.95 11214.72 ยฑ 4836.95
Qwen/Qwen3.6-35B-A3B tg128 @ d8192 (c4) 37.13 ยฑ 0.00 37.69 ยฑ 16.92 135.00 ยฑ 0.00 67.75 ยฑ 0.43

Just saw this post @iotcoi on X:
Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny GB10 GPU at 49W power

I will not post the link so I am not flagged here, but he was the first to make DDTree work on spark but, looks like, is not openly sharing it.

yep, DDTree looks very promising, just need someone with better technical skillz than mine to bolt it into vllm.

He said he will clean it up and put it here:

It is empty now, but he might upload the code soon. I think most people will share if you ask. Get more people to test the idea will help everyone progress faster.

Single node test here, but still need alot more - however it does work!

โ”€โ”€ Run 1/2 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  [Q&A] 256 tokens in 13.47s = 19.0 tok/s (prompt: 23)
  [Code] 512 tokens in 26.60s = 19.2 tok/s (prompt: 30)
  [JSON] 1024 tokens in 47.20s = 21.6 tok/s (prompt: 48)
  [Math] 64 tokens in 3.20s = 20.0 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 99.56s = 20.5 tok/s (prompt: 37)

โ”€โ”€ Run 2/2 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  [Q&A] 256 tokens in 13.57s = 18.8 tok/s (prompt: 23)
  [Code] 512 tokens in 26.74s = 19.1 tok/s (prompt: 30)
  [JSON] 1024 tokens in 47.12s = 21.7 tok/s (prompt: 48)
  [Math] 64 tokens in 3.20s = 20.0 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 99.51s = 20.5 tok/s (prompt: 37)

Recipe

name: Qwen3.6-27B-FP8
recipe_version: "1"
description: "vLLM serving Qwen3.6-27B in FP8 with MTP speculative decoding, 262K context, tool calling"

model: Qwen/Qwen3.6-27B-FP8

container: vllm-node-tf5

build_args:
  - --tf5

defaults:
  port: 8000
  host: 0.0.0.0
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 16384
  max_num_seqs: 4

env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

command: |
  vllm serve Qwen/Qwen3.6-27B-FP8 \
    -O3 \
    --max-model-len {max_model_len} \
    --max-num-seqs {max_num_seqs} \
    --enable-prefix-caching \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --port {port} \
    --host {host} \
    --load-format instanttensor \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --trust-remote-code \
    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \
    --speculative-config '{{"method": "qwen3_next_mtp", "num_speculative_tokens": 3}}' \
    --generation-config auto \
    --override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}'