Qwen/Qwen3.6-35B-A3B (and FP8) has landed

Here we go for 2x DGX Spark performance (revised):

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --host 0.0.0.0 \
    --port 8080 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 4 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --attention-backend flashinfer \
    --load-format instanttensor \
    --trust-remote-code \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray

Benchmarks:
100% successful completion at ToolCall-15.

| model                    |             test |              t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:-------------------------|-----------------:|-----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| Qwen/Qwen3.6-35B-A3B-FP8 |           pp2048 | 7824.25 ± 162.29 |              |    263.59 ± 5.42 |    261.95 ± 5.42 |    263.65 ± 5.42 |
| Qwen/Qwen3.6-35B-A3B-FP8 |            tg128 |     77.74 ± 0.44 | 78.33 ± 0.47 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   pp2048 @ d4096 |  8496.23 ± 73.66 |              |    724.88 ± 6.36 |    723.24 ± 6.36 |    724.95 ± 6.36 |
| Qwen/Qwen3.6-35B-A3B-FP8 |    tg128 @ d4096 |     76.44 ± 0.09 | 77.00 ± 0.00 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   pp2048 @ d8192 |  8403.24 ± 38.07 |              |   1220.28 ± 5.59 |   1218.64 ± 5.59 |   1220.35 ± 5.59 |
| Qwen/Qwen3.6-35B-A3B-FP8 |    tg128 @ d8192 |     75.76 ± 0.07 | 76.00 ± 0.00 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d16384 |  8217.19 ± 12.29 |              |   2244.87 ± 3.36 |   2243.23 ± 3.36 |   2244.93 ± 3.37 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   tg128 @ d16384 |     74.79 ± 0.08 | 75.33 ± 0.47 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d32768 |   7433.69 ± 7.82 |              |   4685.37 ± 4.98 |   4683.73 ± 4.98 |   4685.42 ± 4.97 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   tg128 @ d32768 |     73.40 ± 0.07 | 74.00 ± 0.00 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d65536 |   6310.26 ± 8.14 |              | 10712.00 ± 13.83 | 10710.35 ± 13.83 | 10712.06 ± 13.84 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   tg128 @ d65536 |     69.90 ± 0.04 | 71.00 ± 0.00 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d131072 |  4672.69 ± 15.40 |              | 28491.11 ± 93.91 | 28489.47 ± 93.91 | 28491.18 ± 93.92 |
| Qwen/Qwen3.6-35B-A3B-FP8 |  tg128 @ d131072 |     64.28 ± 0.41 | 65.33 ± 0.47 |                  |                  |                  |

llama-benchy (0.3.5)
date: 2026-04-16 17:59:04 | latency mode: api
2 Likes