Qwen3.6-27B is out!

Qwen3.6-27B is out ! i hope for 122B now

7 Likes

This looks awesome. I was already impressed by the 3.5 27B model. By the stats, itโ€™s a clean step forward everywhere except a teeny regression in one STEM benchmark (probably too close to call honestly).

FP8 with MTP on the 3.5 version was running about 12 tok/s for me. Will be interesting to see if the MTP has improved on 27B like it seems to have on 35B-A3B.

2 Likes

Against Opus 4.5, a 27b model, as Jensen already said, the next two decades will be incredible.

Iโ€™ve test 3.6 on both single and duo sparks, the speed is the same compare to 3.5: TPOP (290 , 137) in ms.

I tried the FP8 version on my Dual Node Cluster. This one will highly benefit from the typical Intel or cyankiwi treatment:

vllm serve Qwen/Qwen3.6-27B-FP8 \
    --host 0.0.0.0 \
    --port 8080 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 16384 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-seqs 4 \
    --load-format instanttensor \
    --attention-backend flashinfer \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --served-model-name Qwen3.6-27B \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --override-generation-config "{\"temperature\": 0.6, \"top_p\": 0.95, \"top_k\": 20, \"min_p\": 0.0, \"presence_penalty\": 0.0, \"repetition_penalty\": 1.0}" \
    --default-chat-template-kwargs '{"preserve_thinking": true}' \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray

llama-benchy Results


โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Test                                       โ”ƒ     c      โ”ƒ               pp t/s โ”ƒ               tg t/s โ”ƒ              TTFT (ms) โ”ƒ             Total (ms) โ”ƒ                Tokens โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ pp2048 tg128 @ d0                          โ”‚     c1     โ”‚                3,067 โ”‚                 14.4 โ”‚                    761 โ”‚                  9,551 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d0                          โ”‚     c2     โ”‚                2,007 โ”‚                 25.8 โ”‚                  1,584 โ”‚                 10,904 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d0                          โ”‚     c4     โ”‚                1,036 โ”‚                 41.2 โ”‚                  7,560 โ”‚                 17,632 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096                       โ”‚     c1     โ”‚                1,628 โ”‚                 14.4 โ”‚                  3,868 โ”‚                 12,666 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096                       โ”‚     c2     โ”‚                  920 โ”‚                 13.7 โ”‚                  8,619 โ”‚                 20,701 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096                       โ”‚     c4     โ”‚                  895 โ”‚                 16.5 โ”‚                 19,063 โ”‚                 32,263 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192                       โ”‚     c1     โ”‚                1,590 โ”‚                 14.3 โ”‚                  6,535 โ”‚                 15,384 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192                       โ”‚     c2     โ”‚                  861 โ”‚                  9.6 โ”‚                 15,172 โ”‚                 28,682 โ”‚              2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192                       โ”‚     c4     โ”‚                  757 โ”‚                 10.7 โ”‚                 44,931 โ”‚                 57,299 โ”‚              2048+128 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

tool-eval-bench Results

                                                                                Category Breakdown
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Category                                                     โ”ƒ          Score           โ”ƒ Bar                                                         โ”ƒ         Earned          โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Tool Selection                                               โ”‚           100%           โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        โ”‚           6/6           โ”‚
โ”‚ Parameter Precision                                          โ”‚           100%           โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        โ”‚           6/6           โ”‚
โ”‚ Multi-Step Chains                                            โ”‚           100%           โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        โ”‚           6/6           โ”‚
โ”‚ Restraint & Refusal                                          โ”‚           100%           โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        โ”‚           6/6           โ”‚
โ”‚ Error Recovery                                               โ”‚           100%           โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        โ”‚           6/6           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ† Benchmark Complete โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    Model:  Qwen/Qwen3.6-27B-FP8                                                                                                                                                 โ”‚
โ”‚    Score:  100 / 100                                                                                                                                                            โ”‚
โ”‚    Rating: โ˜…โ˜…โ˜…โ˜…โ˜… Excellent                                                                                                                                                      โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    โœ… 15 passed   โš ๏ธ  0 partial   โŒ 0 failed                                                                                                                                   โ”‚
โ”‚    Points: 30/30                                                                                                                                                                โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    Quality:        100/100                                                                                                                                                      โ”‚
โ”‚    Responsiveness: 17/100  (median turn: 8.7s)                                                                                                                                  โ”‚
โ”‚    Deployability:  75/100  (ฮฑ=0.7)                                                                                                                                              โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    Completed in 398.5s                                                                                                                                                          โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    ๐Ÿ“Š Token Usage:                                                                                                                                                              โ”‚
โ”‚    Total: 37,754 tokens  โ”‚  Efficiency: 0.8 pts/1K tokens                                                                                                                       โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    โšก Throughput:                                                                                                                                                               โ”‚
โ”‚    Single:  3,067 pp t/s  โ”‚  14.4 tg t/s  โ”‚  TTFT 761ms                                                                                                                         โ”‚
โ”‚    c2:      2,007 pp t/s  โ”‚  25.8 tg t/s                                                                                                                                        โ”‚
โ”‚    c4:      1,036 pp t/s  โ”‚  41.2 tg t/s                                                                                                                                        โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ”‚    โ”€โ”€ How this score is calculated โ”€โ”€                                                                                                                                           โ”‚
โ”‚    โ€ข Each scenario: pass=2pt, partial=1pt, fail=0pt                                                                                                                             โ”‚
โ”‚    โ€ข Category %: earned / max per category                                                                                                                                      โ”‚
โ”‚    โ€ข Final score: (total points / max points) ร— 100                                                                                                                             โ”‚
โ”‚    โ€ข Deployability: 0.7ร—quality + 0.3ร—responsiveness                                                                                                                            โ”‚
โ”‚    โ€ข Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s)                                                                                                          โ”‚
โ”‚                                                                                                                                                                                 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
1 Like

Youโ€™ll really want to enable the built in MTP.

In the end weโ€™ll have VRAM left over.

With one single Spark:

docker run -d
โ€“privileged --name qwen3.6-27B-FP8
โ€“gpus all
โ€“network host --ipc=host
-v ~/.cache/huggingface:/root/.cache/huggingface
vllm-node
vllm serve Qwen/Qwen3.6-27B-FP8
โ€“host 0.0.0.0
โ€“port 8080
โ€“tensor-parallel-size 1
โ€“gpu-memory-utilization 0.75
โ€“max-model-len 32768
โ€“max-num-batched-tokens 16384
โ€“enable-prefix-caching
โ€“enable-chunked-prefill
โ€“max-num-seqs 4
โ€“load-format auto
โ€“attention-backend flashinfer
โ€“dtype auto
โ€“kv-cache-dtype fp8
โ€“trust-remote-code
โ€“enable-auto-tool-choice
โ€“served-model-name Qwen3.6-27B-FP8
โ€“tool-call-parser qwen3_coder
โ€“reasoning-parser qwen3
โ€“override-generation-config โ€˜{โ€œtemperatureโ€: 0.6, โ€œtop_pโ€: 0.95, โ€œtop_kโ€: 20, โ€œmin_pโ€: 0.0}โ€™
โ€“default-chat-template-kwargs โ€˜{โ€œpreserve_thinkingโ€: true}โ€™

tool-eval-bench Results

What is the maximum usable context with a single unit?
At the moment I am using qwen3-next-coder with 256K of context using 106GB, the 3.6 27B should free up a lot of RAMโ€ฆ

Dual Node with MTP. I am still testing this but it seems like each speculative token adds an allgather across the inter-node link. With num_speculative_tokens=2 , thatโ€™s 2 extra cross-node round trips per decode step on top of the normal allreduce โ€” likely 2-3ร— the communication overhead, eating any speedup from speculation โ€“ @eugr may have smart ideas how that could be tackled:

vllm serve Qwen/Qwen3.6-27B-FP8 \
    --host 0.0.0.0 \
    --port 8080 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 16384 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-seqs 4 \
    --load-format instanttensor \
    --attention-backend flashinfer \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --served-model-name Qwen3.6-27B \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --override-generation-config "{\"temperature\": 0.6, \"top_p\": 0.95, \"top_k\": 20, \"min_p\": 0.0, \"presence_penalty\": 0.0, \"repetition_penalty\": 1.0}" \
    --default-chat-template-kwargs '{"preserve_thinking": true}' \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 2 \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray

Results:

๐Ÿ”ง Tool-Call Benchmark
  Server: http://0.0.0.0:8080
  Querying http://0.0.0.0:8080/v1/models โ€ฆ โœ“ Qwen/Qwen3.6-27B-FP8 (alias: Qwen3.6-27B)

  โœ“ Warm-up complete (3280 ms)

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โšก llama-benchy Throughput Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Qwen/Qwen3.6-27B-FP8                                                                              โ”‚
โ”‚ pp=[2048]  tg=[128]  depth=[0, 4096, 8192]  concurrency=[1, 2, 4]  runs=3  latency=generation     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

  โœ“ Complete โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 27/27 0:09:46

  llama-benchy 0.3.5
  Estimated latency: 181.1 ms

                                        llama-benchy Results
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Test                   โ”ƒ   c   โ”ƒ     pp t/s โ”ƒ     tg t/s โ”ƒ   TTFT (ms) โ”ƒ  Total (ms) โ”ƒ     Tokens โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ pp2048 tg128 @ d0      โ”‚  c1   โ”‚        902 โ”‚        7.2 โ”‚       2,454 โ”‚      20,093 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d0      โ”‚  c2   โ”‚      1,111 โ”‚       12.7 โ”‚       3,870 โ”‚      22,416 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d0      โ”‚  c4   โ”‚      1,213 โ”‚       20.9 โ”‚       7,117 โ”‚      29,162 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096   โ”‚  c1   โ”‚      1,364 โ”‚        7.8 โ”‚       4,688 โ”‚      20,977 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096   โ”‚  c2   โ”‚        920 โ”‚       11.1 โ”‚      12,133 โ”‚      31,592 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d4096   โ”‚  c4   โ”‚        808 โ”‚       20.2 โ”‚      27,813 โ”‚      48,387 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192   โ”‚  c1   โ”‚      1,392 โ”‚        7.7 โ”‚       7,630 โ”‚      24,175 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192   โ”‚  c2   โ”‚        933 โ”‚        7.2 โ”‚      17,000 โ”‚      40,788 โ”‚   2048+128 โ”‚
โ”‚ pp2048 tg128 @ d8192   โ”‚  c4   โ”‚        681 โ”‚        4.7 โ”‚      47,587 โ”‚      78,185 โ”‚   2048+128 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  โ„น Metrics sourced from llama-benchy โ€” see https://github.com/eugr/llama-benchy for methodology.


โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ”ฎ Speculative Decoding Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Qwen/Qwen3.6-27B-FP8                                                                              โ”‚
โ”‚ tg=128  depth=[0, 4096, 8192]  prompts=['filler', 'code', 'structured']  method=auto              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
Prometheus /metrics acceptance-rate counters are server-wide aggregates. If other models are serving concurrent traffic on this endpoint, per-request acceptance rate measurements will be inaccurate. For clean measurements: use a single-model server with no concurrent load.
  โœ“     filler @ d0  17.0 eff t/s  16.9 stream t/s  ฮฑ=88.0%  ฯ„=1.8
  โœ“       code @ d0  19.2 eff t/s  19.1 stream t/s  ฮฑ=94.3%  ฯ„=1.9
  โœ“ structured @ d0  18.1 eff t/s  17.9 stream t/s  ฮฑ=85.1%  ฯ„=1.7
  โœ“     filler @ d4096  11.8 eff t/s  11.7 stream t/s  ฮฑ=90.2%  ฯ„=1.8
  โœ“       code @ d4096  20.4 eff t/s  20.2 stream t/s  ฮฑ=94.3%  ฯ„=1.9
  โœ“ structured @ d4096  18.8 eff t/s  18.7 stream t/s  ฮฑ=85.1%  ฯ„=1.7
  โœ“     filler @ d8192  10.1 eff t/s  10.0 stream t/s  ฮฑ=85.1%  ฯ„=1.7
  โœ“       code @ d8192  20.9 eff t/s  20.8 stream t/s  ฮฑ=94.3%  ฯ„=1.9
  โœ“ structured @ d8192  19.1 eff t/s  19.0 stream t/s  ฮฑ=85.1%  ฯ„=1.7

                   Speculative Decoding Results
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Prompt     โ”ƒ Depth โ”ƒ Eff t/s โ”ƒ   ฮฑ % โ”ƒ ฯ„ len โ”ƒ TTFT โ”ƒ Total ms โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ filler     โ”‚     0 โ”‚    17.0 โ”‚ 88.0% โ”‚   1.8 โ”‚   10 โ”‚    7,530 โ”‚
โ”‚ code       โ”‚     0 โ”‚    19.2 โ”‚ 94.3% โ”‚   1.9 โ”‚    7 โ”‚    6,661 โ”‚
โ”‚ structured โ”‚     0 โ”‚    18.1 โ”‚ 85.1% โ”‚   1.7 โ”‚    8 โ”‚    7,084 โ”‚
โ”‚ filler     โ”‚    4K โ”‚    11.8 โ”‚ 90.2% โ”‚   1.8 โ”‚   20 โ”‚   10,898 โ”‚
โ”‚ code       โ”‚    4K โ”‚    20.4 โ”‚ 94.3% โ”‚   1.9 โ”‚    6 โ”‚    6,279 โ”‚
โ”‚ structured โ”‚    4K โ”‚    18.8 โ”‚ 85.1% โ”‚   1.7 โ”‚    8 โ”‚    6,815 โ”‚
โ”‚ filler     โ”‚    8K โ”‚    10.1 โ”‚ 85.1% โ”‚   1.7 โ”‚   19 โ”‚   12,680 โ”‚
โ”‚ code       โ”‚    8K โ”‚    20.9 โ”‚ 94.3% โ”‚   1.9 โ”‚    7 โ”‚    6,127 โ”‚
โ”‚ structured โ”‚    8K โ”‚    19.1 โ”‚ 85.1% โ”‚   1.7 โ”‚    8 โ”‚    6,699 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  Highest acceptance: code (94.3%)  Lowest: structured (85.1%)

  ๐Ÿ“„ Report saved to
/home/tim/.local/share/uv/tools/tool-eval-bench/lib/python3.12/runs/2026/04/2026-04-22T19-39-05Z_86b6
57.md


โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ”ง Tool-Call Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Qwen/Qwen3.6-27B-FP8  via vllm @ http://0.0.0.0:8080                                              โ”‚
โ”‚ 15 scenarios                                                                                      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

  โ— TC-01  Direct Specialist Match         โœ… PASS  2/2  15.8s  ttft=3,097ms t2  Used get_weather
with Berlin only.
  โ— TC-02  Distractor Resistance           โœ… PASS  2/2  11.4s  ttft=3,943ms t2  Used only
get_stock_price for AAPL.
  โ— TC-03  Implicit Tool Need              โœ… PASS  2/2  20.0s  ttft=6,678ms t3  Looked up Sarah
before sending the email.
  โ— TC-04  Unit Handling                   โœ… PASS  2/2   9.0s  ttft=2,993ms t2  Requested Tokyo
weather in Fahrenheit explicitly.
  โ— TC-05  Date and Time Parsing           โœ… PASS  2/2  36.3s  ttft=13,289ms t3  Parsed next Monday
and included the requested meeting details.
  โ— TC-06  Multi-Value Extraction          โœ… PASS  2/2  47.9s  ttft=33,088ms t3  Issued separate
translate_text calls for both languages.
  โ— TC-07  Search โ†’ Read โ†’ Act             โœ… PASS  2/2  37.5s  ttft=6,606ms t5  Completed the full
four-step chain with the right data.
  โ— TC-08  Conditional Branching           โœ… PASS  2/2  30.4s  ttft=10,393ms t3  Checked the weather
first, then set the rainy-day reminder.
  โ— TC-09  Parallel Independence           โœ… PASS  2/2  23.1s  ttft=5,234ms t2  Handled both
independent tasks.
  โ— TC-10  Trivial Knowledge               โœ… PASS  2/2  10.0s  ttft=7,546ms  Answered directly
without tool use.
  โ— TC-11  Simple Math                     โœ… PASS  2/2  24.4s  ttft=23,737ms  Did the math directly.
  โ— TC-12  Impossible Request              โœ… PASS  2/2  15.9s  ttft=8,466ms  Refused cleanly because
no delete-email tool exists.
  โ— TC-13  Empty Results                   โœ… PASS  2/2  15.5s  ttft=2,961ms t3  Retried after the
empty result and recovered.
  โ— TC-14  Malformed Response              โœ… PASS  2/2  11.5s  ttft=2,998ms t2  Acknowledged the
stock tool failure and handled it gracefully.
  โ— TC-15  Conflicting Information         โœ… PASS  2/2  23.3s  ttft=3,870ms t3  Used the searched
population value in the calculator.

                                         Category Breakdown
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Category                         โ”ƒ     Score     โ”ƒ Bar                              โ”ƒ   Earned    โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Tool Selection                   โ”‚     100%      โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             โ”‚     6/6     โ”‚
โ”‚ Parameter Precision              โ”‚     100%      โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             โ”‚     6/6     โ”‚
โ”‚ Multi-Step Chains                โ”‚     100%      โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             โ”‚     6/6     โ”‚
โ”‚ Restraint & Refusal              โ”‚     100%      โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             โ”‚     6/6     โ”‚
โ”‚ Error Recovery                   โ”‚     100%      โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             โ”‚     6/6     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ† Benchmark Complete โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                                                                                   โ”‚
โ”‚    Model:  Qwen/Qwen3.6-27B-FP8                                                                   โ”‚
โ”‚    Score:  100 / 100                                                                              โ”‚
โ”‚    Rating: โ˜…โ˜…โ˜…โ˜…โ˜… Excellent                                                                        โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    โœ… 15 passed   โš ๏ธ  0 partial   โŒ 0 failed                                                     โ”‚
โ”‚    Points: 30/30                                                                                  โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    Quality:        100/100                                                                        โ”‚
โ”‚    Responsiveness: 20/100  (median turn: 7.4s)                                                    โ”‚
โ”‚    Deployability:  76/100  (ฮฑ=0.7)                                                                โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    Completed in 332.0s                                                                            โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    ๐Ÿ“Š Token Usage:                                                                                โ”‚
โ”‚    Total: 40,561 tokens  โ”‚  Efficiency: 0.7 pts/1K tokens                                         โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    โšก Throughput:                                                                                 โ”‚
โ”‚    Single:  1,392 pp t/s  โ”‚  7.8 tg t/s  โ”‚  TTFT 4,688ms                                          โ”‚
โ”‚    c2:      1,111 pp t/s  โ”‚  12.7 tg t/s                                                          โ”‚
โ”‚    c4:      1,213 pp t/s  โ”‚  20.9 tg t/s                                                          โ”‚
โ”‚                                                                                                   โ”‚
โ”‚    โ”€โ”€ How this score is calculated โ”€โ”€                                                             โ”‚
โ”‚    โ€ข Each scenario: pass=2pt, partial=1pt, fail=0pt                                               โ”‚
โ”‚    โ€ข Category %: earned / max per category                                                        โ”‚
โ”‚    โ€ข Final score: (total points / max points) ร— 100                                               โ”‚
โ”‚    โ€ข Deployability: 0.7ร—quality + 0.3ร—responsiveness                                              โ”‚
โ”‚    โ€ข Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s)                            โ”‚
โ”‚                                                                                                   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
1 Like

I ran the same subset of AgentBench with the FP8 version, and amazingly it was faster than the MoE (also FP8) version. I can only assume it went in less circles or generated less errors calling tools.

(I also canโ€™t explain why the FP8 MoE beat the bf16 MoE, but I ran them both multiple times, and each run took the mean of 3 epochs - the results were oddly consistent)

Iโ€™ll try to kick off the bf16 version soon.

1 Like

Iโ€™m thinking that, although there are improvements over 3.5, overall the model is just thinking too much and getting bogged down without the prospect of an actual result.

Testing with cyankiwi/Qwen3.6-27B-AWQ-INT4 I get decent responses from fairly simple prompts, but when I throw it at a real-world complex coding problem, it fails.

I have quite a complex graphic program running, with an obvious bug that needs fixing. Qwen 3.6 attacked the problem with tokens being generated at a good speed. But it became obvious that with the endless thinking, that the problem was too complex for it to deal with. I gave it a good amount of time to get somewhere, but after 20 minutes or so, gave up.

On the other hand, Minimax M2.7 tackled the problem with a decent amount of thinking time, but came up with a solution, tested it with Playwright, found an error and then finished with a working system with the bug resolved.

The Qwen 3.6 models may be getting great benchmark scores, but Iโ€™m not seeing this translate into being useful on complex coding problems.

1 Like