Here we go for 2x DGX Spark performance (revised):
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--host 0.0.0.0 \
--port 8080 \
--gpu-memory-utilization 0.8 \
--max-model-len 262144 \
--max-num-batched-tokens 8192 \
--max-num-seqs 4 \
--enable-prefix-caching \
--enable-chunked-prefill \
--attention-backend flashinfer \
--load-format instanttensor \
--trust-remote-code \
--dtype auto \
--kv-cache-dtype fp8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray
Benchmarks:
100% successful completion at ToolCall-15.
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-------------------------|-----------------:|-----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 | 7824.25 ± 162.29 | | 263.59 ± 5.42 | 261.95 ± 5.42 | 263.65 ± 5.42 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 | 77.74 ± 0.44 | 78.33 ± 0.47 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d4096 | 8496.23 ± 73.66 | | 724.88 ± 6.36 | 723.24 ± 6.36 | 724.95 ± 6.36 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d4096 | 76.44 ± 0.09 | 77.00 ± 0.00 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d8192 | 8403.24 ± 38.07 | | 1220.28 ± 5.59 | 1218.64 ± 5.59 | 1220.35 ± 5.59 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d8192 | 75.76 ± 0.07 | 76.00 ± 0.00 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d16384 | 8217.19 ± 12.29 | | 2244.87 ± 3.36 | 2243.23 ± 3.36 | 2244.93 ± 3.37 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d16384 | 74.79 ± 0.08 | 75.33 ± 0.47 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d32768 | 7433.69 ± 7.82 | | 4685.37 ± 4.98 | 4683.73 ± 4.98 | 4685.42 ± 4.97 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d32768 | 73.40 ± 0.07 | 74.00 ± 0.00 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d65536 | 6310.26 ± 8.14 | | 10712.00 ± 13.83 | 10710.35 ± 13.83 | 10712.06 ± 13.84 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d65536 | 69.90 ± 0.04 | 71.00 ± 0.00 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d131072 | 4672.69 ± 15.40 | | 28491.11 ± 93.91 | 28489.47 ± 93.91 | 28491.18 ± 93.92 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg128 @ d131072 | 64.28 ± 0.41 | 65.33 ± 0.47 | | | |
llama-benchy (0.3.5)
date: 2026-04-16 17:59:04 | latency mode: api