Qwen/Qwen3.6-35B-A3B (and FP8) has landed

If you have 8 Sparks ;-)

1 Like
  • Single GPU (no distributed inference): if the model fits on a single GPU, distributed inference is probably unnecessary. Run inference on that GPU.
  • Single-node multi-GPU using tensor parallel inference: if the model is too large for a single GPU but fits on a single node with multiple GPUs, use tensor parallelism. For example, set tensor_parallel_size=4 when using a node with 4 GPUs.

The TP value equals the number of GPUs you are using. You don’t have to specify that argument when you are using only one Spark.

If you share the vLLM version, recipe or command with arguments you have used and may be even share the log output of vLLM someone in here might be able to help you. ;-)

1 Like

Been doing a bit of tests serving enabling MTP, and frankly it does seem to work:

── Run 1/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 4.34s = 58.9 tok/s (prompt: 23)
  [Code] 512 tokens in 8.11s = 63.1 tok/s (prompt: 30)
  [JSON] 1024 tokens in 15.84s = 64.6 tok/s (prompt: 48)
  [Math] 64 tokens in 1.02s = 62.7 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 31.00s = 66.0 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 4.35s = 58.8 tok/s (prompt: 23)
  [Code] 512 tokens in 8.14s = 62.8 tok/s (prompt: 30)
  [JSON] 1024 tokens in 15.64s = 65.4 tok/s (prompt: 48)
  [Math] 64 tokens in 1.02s = 62.7 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 31.03s = 66.0 tok/s (prompt: 37)

The recipe I’m testing is:

# Recipe: Qwen/Qwen3.6-35B-A3B-FP8
# Qwen/Qwen3.6-35B-A3B model in native FP8 format


recipe_version: "1"
name: Qwen35-35B-A3B
description: vLLM serving Qwen3.6-35B-A3B-FP8

# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8

solo_only: true

# Container image to use
container: vllm-node

# Mods
mods:
  - mods/fix-qwen3.5-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.8
  max_model_len: 262144
  max_num_batched_tokens: 32768

# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

# The vLLM serve command template
command: |
  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --host {host} \
    --port {port} \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --enable-auto-tool-choice \
    --generation-config auto \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --kv-cache-dtype fp8 \
    --load-format fastsafetensors \
    --attention-backend flashinfer \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --speculative-config '{{"method":"qwen3_next_mtp","num_speculative_tokens":2}}' \
    --override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}'
1 Like

My testing shows that 3.6 tool call stability matches 3.5 with applied fixes. Very promising results.

Can you let us know which fixes you applied?

I just applied the recipe from the other thread: the XML parser + the new template.

(Translated by Gemini)

1 Like

Single Spark, vLLM FP8 + MTP-3: concurrency scaling under pressure

Thanks to everyone in this thread, running the stack cosinus and Turrican described (eugr/spark-vllm-docker, vLLM 0.19.1rc1.dev337+g17d87168d, Qwen/Qwen3.6-35B-A3B-FP8).

Config (only interesting flags):

  • max-model-len 262144 --max-num-batched-tokens 16384
  • gpu-memory-utilization 0.7
  • kv-cache-dtype fp8 --load-format fastsafetensors
  • attention-backend flashinfer --enable-prefix-caching
  • speculative-config β€˜{β€œmethod”:β€œmtp”,β€œnum_speculative_tokens”:3}’

I tried num_speculative_tokens 2, 3, 4. 3 is the sweet spot. At 4 the acceptance rate collapses and throughput drops back below baseline. At 3, acceptance length stays ~2.77 across the whole concurrency
sweep.

Single-client (5-prompt coding suite, T=0, 512 tok each):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Config β”‚ avg tok/s β”‚ peak tok/s β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ FP8 baseline β”‚ 51.2 β”‚ 51.4 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ FP8 + MTP-2 β”‚ 58.6 β”‚ 63.0 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ FP8 + MTP-3 β”‚ 63.9 β”‚ 67.8 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ FP8 + MTP-4 β”‚ 52.9 β”‚ 61.5 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

My 51.2 baseline lines up with cosinus’s 52.7, so same ballpark.

Random dataset, tg128:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Test β”‚ Agg out tok/s β”‚ TPOT mean β”‚ TTFT mean β”‚ Accept len β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ pp2048 c=1 β”‚ 5020 t/s total β”‚ β€” β”‚ 410 ms β”‚ β€” β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ tg128 @ d8192 c=1 β”‚ 32.6 β”‚ 16.1 ms β”‚ 1878 ms β”‚ β€” β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ tg128 c=2 β”‚ 78.7 β”‚ 21.2 ms β”‚ 551 ms β”‚ 2.78 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ tg128 c=4 β”‚ 106.4 β”‚ 27.4 ms β”‚ 1157 ms β”‚ 2.72 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ tg128 c=8 β”‚ 196.7 β”‚ 36.4 ms β”‚ 344 ms β”‚ 2.73 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ tg128 c=16 β”‚ 286.9 β”‚ 49.6 ms β”‚ 500 ms β”‚ 2.71 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ tg128 c=32 β”‚ 411.6 β”‚ 67.9 ms β”‚ 633 ms β”‚ 2.77 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ mixed 1024in/256out c=16 β”‚ 250 out / 1264 total β”‚ 54.8 ms β”‚ 1417 ms β”‚ 2.77 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Takeaways (AI generated):

  1. MTP-3 keeps working under load β€” acceptance rate stable 57–59% from c=1 to c=32. Not just a single-client trick.
  2. Aggregate output scales ~5Γ— from c=2 β†’ c=32 (for 16Γ— concurrency). Saturation point looks like ~c=32 at P99 TPOT 93 ms.
  3. pp2048 is unchanged at 5020 t/s with MTP-3 on β€” no prefill penalty.
  4. At 8k context decode drops ~40% (52 β†’ 33 t/s). Prefix caching recovers most of that on multi-turn chat.
  5. For serving multiple users on one Spark, the realistic mixed workload (1024 in / 256 out @ c=16) gives ~1.26k total tok/s β€” very usable.

Serapis: your tg128 of 76 on dual Spark made me curious, is that decode TPOT from vllm bench latency? Because aggregate-output-per-client at c=1 caps around 52, but if measured TPOT the same way (128 tok / (e2el βˆ’ ttft)), also get ~55–60 tok/s which is closer to your number.

2 Likes

I just tested mmangkad/Qwen3.6-35B-A3B-NVFP4 with a few know variations flags

--trust-remote-code \
--quantization fp4 \
--moe-backend marlin \
--async-scheduling \

results where ~35 t/s , down from ~52t/s with FP8 script above.

With Qwen/Qwen3.6-35B-A3B-FP8 and flag: or any variation of speculative-config above averaged ~20-25t/s

--speculative-config '{{"method":"mtp","num_speculative_tokens":3}}' \

Could we have nailed it on the 1st try?

| model                    |             test |              t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:-------------------------|-----------------:|-----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| Qwen/Qwen3.6-35B-A3B-FP8 |           pp2048 |  6219.08 ± 85.05 |              |    407.92 ± 4.52 |    329.59 ± 4.52 |    408.02 ± 4.53 |
| Qwen/Qwen3.6-35B-A3B-FP8 |             tg32 |     51.86 ± 0.08 | 53.54 ± 0.08 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   ctx_pp @ d4096 | 5455.55 ± 520.28 |              |   836.76 ± 77.46 |   758.43 ± 77.46 |   836.84 ± 77.46 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   ctx_tg @ d4096 |     51.59 ± 0.23 | 53.26 ± 0.23 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   pp2048 @ d4096 |  2798.01 ± 21.56 |              |    810.33 ± 5.61 |    731.99 ± 5.61 |    810.40 ± 5.62 |
| Qwen/Qwen3.6-35B-A3B-FP8 |     tg32 @ d4096 |     51.87 ± 0.29 | 53.54 ± 0.30 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   ctx_pp @ d8192 |  6381.58 ± 47.97 |              |   1362.26 ± 9.69 |   1283.92 ± 9.69 |   1362.35 ± 9.68 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   ctx_tg @ d8192 |     51.58 ± 0.31 | 53.25 ± 0.32 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   pp2048 @ d8192 |   2645.55 ± 6.98 |              |    852.47 ± 2.04 |    774.14 ± 2.04 |    852.56 ± 2.04 |
| Qwen/Qwen3.6-35B-A3B-FP8 |     tg32 @ d8192 |     51.30 ± 0.06 | 52.96 ± 0.06 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_pp @ d16384 |  5693.38 ± 19.62 |              |   2956.33 ± 9.89 |   2878.00 ± 9.89 |   2956.39 ± 9.90 |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_tg @ d16384 |     51.07 ± 0.25 | 52.72 ± 0.26 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d16384 |   2398.90 ± 8.09 |              |    932.07 ± 2.88 |    853.73 ± 2.88 |    932.16 ± 2.88 |
| Qwen/Qwen3.6-35B-A3B-FP8 |    tg32 @ d16384 |     50.49 ± 0.04 | 52.13 ± 0.04 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_pp @ d32768 |   4955.89 ± 9.58 |              |  6690.50 ± 12.93 |  6612.16 ± 12.93 |  6690.58 ± 12.94 |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_tg @ d32768 |     50.65 ± 0.11 | 52.29 ± 0.12 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d32768 |  2077.41 ± 13.80 |              |   1064.22 ± 6.52 |    985.89 ± 6.52 |   1064.30 ± 6.52 |
| Qwen/Qwen3.6-35B-A3B-FP8 |    tg32 @ d32768 |     49.89 ± 0.08 | 51.50 ± 0.08 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_pp @ d65535 |   3999.49 ± 2.21 |              |  16464.43 ± 9.06 |  16386.10 ± 9.06 |  16464.51 ± 9.05 |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_tg @ d65535 |     45.99 ± 0.33 | 47.56 ± 0.32 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d65535 |  1774.20 ± 13.42 |              |   1232.72 ± 8.70 |   1154.39 ± 8.70 |   1232.80 ± 8.69 |
| Qwen/Qwen3.6-35B-A3B-FP8 |    tg32 @ d65535 |     46.04 ± 0.35 | 47.62 ± 0.33 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_pp @ d100000 |   3289.23 ± 1.78 |              | 30481.02 ± 16.28 | 30402.69 ± 16.28 | 30481.09 ± 16.28 |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_tg @ d100000 |     43.67 ± 0.20 | 45.18 ± 0.20 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d100000 |   1053.47 ± 3.67 |              |   2022.41 ± 6.76 |   1944.07 ± 6.76 |   2022.46 ± 6.78 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   tg32 @ d100000 |     43.47 ± 0.28 | 45.01 ± 0.27 |                  |                  |                  |

It’s not bad at all, qwen3.6 created these HTML games with working controls, pretty good!

3 Likes
  • KV cache: 1,530,704 tokens (vs ~400K with FP8 KV cache)

TurboQuant hybrid (PR 39931) is working!

Do you mind posting a llama benchy, too?

I get pretty good results without MTP but as soon as I add MTP my token generation drops to 20-22 t/s.

I tried a couple of different variants including just a single Spark and was not able to get MTP to work properly.

@eugr I remember you shared about challenges with llama-benchy and MTP. Could this be what I run into? Do you have a recommendation how to measure the performance between speculative decoding and non-speculative decoding setups?

1 Like

Qwen3.6 DFlash published.

Single spark tested with DFlash

═══ Benchmark ═══

[βœ“] Model: Qwen/Qwen3.6-35B-A3B-FP8

╔══════════════════════════════════════════════════════╗

β•‘ Benchmark: Qwen3.6-35B-A3B-FP8 β€” 2026-04-17 15:08

β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Warm-up… done

── Sequential (1 request) ──────────────────────────────

Run 1/2:

[Q&A ] 256 tokens in 4.01s = 63.8 tok/s

[Code ] 512 tokens in 6.15s = 83.2 tok/s

[JSON ] 1024 tokens in 10.06s = 101.7 tok/s

[Math ] 32 tokens in .38s = 84.2 tok/s

[LongCode ] 2048 tokens in 25.16s = 81.3 tok/s

Run 2/2:

[Q&A ] 256 tokens in 3.96s = 64.5 tok/s

[Code ] 512 tokens in 6.16s = 83.0 tok/s

[JSON ] 1024 tokens in 10.09s = 101.4 tok/s

[Math ] 32 tokens in .38s = 83.7 tok/s

[LongCode ] 2048 tokens in 25.07s = 81.6 tok/s

── Concurrent (4 parallel requests) ───────────────────────────

Sending 4 requests simultaneously, measuring total throughput…

[req1 ] 1024 tokens = 49.8 tok/s (end-to-end)

[req2 ] 1024 tokens = 49.8 tok/s (end-to-end)

[req3 ] 1024 tokens = 46.3 tok/s (end-to-end)

[req4 ] 1024 tokens = 46.3 tok/s (end-to-end)

Total: 4096 tokens in 22.14s

Total throughput: 184.9 tok/s (4 requests completed)

1 Like

Throughput with MTP right now depends on somewhat more trivial benchmarks and vibe checks.

llama-benchy will report only the actual raw tokens. It will be lower than without MTP, and not reflect reality.

Measuring throughput with MTP requires a different approach, and it is on eugr’s radar.

1 Like

This is tested on DGX Spark? A single one?

I’m working on that. It is somewhat trivial for fixed prompts, but much more difficult to implement at varying context lengths - I have a few ideas and will try to work on them next week once I’m done with my current backlog.

2 Likes

Yes,One spark solo

Do you have a recipe for it? I would like to try running it on mine too.

The DFlash model download needs to send request.

./spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 \

--solo \

--apply-mod ./spark-vllm-docker/mods/fix-qwen3.5-chat-template \

-d -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \

-e HF_TOKEN=${HF_TOKEN} \

exec vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \

--host 0.0.0.0 \

--port 8000 \

--max-model-len 262144 \

--max-num-batched-tokens 32768 \

--gpu-memory-utilization 0.6 \

--enable-auto-tool-choice \

--tool-call-parser qwen3_xml \

--load-format fastsafetensors \

--enable-prefix-caching \

--chat-template unsloth.jinja \

--speculative-config β€˜{β€œmethod”: β€œdflash”, β€œmodel”: β€œz-lab/Qwen3.6-35B-A3B-DFlash”, β€œnum_speculative_tokens”: 15}’ \

--attention-backend flash_attn

Impressive, can you provide a llm-bench run so we are all comparing the same thing>

uvx llama-benchy --base-url http://localhost:8000/v1 --model Qwen/Qwen3.6-35B-A3B-FP8 --depth 0 4096 8192 16384 32768 65535 100000 --adapt-prompt --latency-mode generation --enable-prefix-caching  

( even if its not perfect for the MTP testing yet… )

1 Like

Hello, do you also run into a lot of issues with AI for web development? With me, about half the time it gets the tool parameters wrong. And when it doesn’t (which isn’t often), it writes out complete scripts of several hundred lines, only to realize afterward, β€œThis is too complex, I’ll do it differently…” This can happen multiple times in a row… That said, the code quality itself is much better than 3.5 with the same parameters, but it still struggles with logic. I’m using vLLM 0.19.1 dev with a patched reasoning/tool parser cause qwen3_xml it’s still buggy.