If you have 8 Sparks ;-)
- Single GPU (no distributed inference): if the model fits on a single GPU, distributed inference is probably unnecessary. Run inference on that GPU.
- Single-node multi-GPU using tensor parallel inference: if the model is too large for a single GPU but fits on a single node with multiple GPUs, use tensor parallelism. For example, set
tensor_parallel_size=4when using a node with 4 GPUs.
The TP value equals the number of GPUs you are using. You donβt have to specify that argument when you are using only one Spark.
If you share the vLLM version, recipe or command with arguments you have used and may be even share the log output of vLLM someone in here might be able to help you. ;-)
Been doing a bit of tests serving enabling MTP, and frankly it does seem to work:
ββ Run 1/2 ββββββββββββββββββββββββββββββββββββββ
[Q&A] 256 tokens in 4.34s = 58.9 tok/s (prompt: 23)
[Code] 512 tokens in 8.11s = 63.1 tok/s (prompt: 30)
[JSON] 1024 tokens in 15.84s = 64.6 tok/s (prompt: 48)
[Math] 64 tokens in 1.02s = 62.7 tok/s (prompt: 29)
[LongCode] 2048 tokens in 31.00s = 66.0 tok/s (prompt: 37)
ββ Run 2/2 ββββββββββββββββββββββββββββββββββββββ
[Q&A] 256 tokens in 4.35s = 58.8 tok/s (prompt: 23)
[Code] 512 tokens in 8.14s = 62.8 tok/s (prompt: 30)
[JSON] 1024 tokens in 15.64s = 65.4 tok/s (prompt: 48)
[Math] 64 tokens in 1.02s = 62.7 tok/s (prompt: 29)
[LongCode] 2048 tokens in 31.03s = 66.0 tok/s (prompt: 37)
The recipe Iβm testing is:
# Recipe: Qwen/Qwen3.6-35B-A3B-FP8
# Qwen/Qwen3.6-35B-A3B model in native FP8 format
recipe_version: "1"
name: Qwen35-35B-A3B
description: vLLM serving Qwen3.6-35B-A3B-FP8
# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8
solo_only: true
# Container image to use
container: vllm-node
# Mods
mods:
- mods/fix-qwen3.5-chat-template
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.8
max_model_len: 262144
max_num_batched_tokens: 32768
# Environment variables
env:
VLLM_MARLIN_USE_ATOMIC_ADD: 1
# The vLLM serve command template
command: |
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--host {host} \
--port {port} \
--max-model-len {max_model_len} \
--max-num-batched-tokens {max_num_batched_tokens} \
--gpu-memory-utilization {gpu_memory_utilization} \
--enable-auto-tool-choice \
--generation-config auto \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--kv-cache-dtype fp8 \
--load-format fastsafetensors \
--attention-backend flashinfer \
--enable-prefix-caching \
--enable-chunked-prefill \
--speculative-config '{{"method":"qwen3_next_mtp","num_speculative_tokens":2}}' \
--override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}'
My testing shows that 3.6 tool call stability matches 3.5 with applied fixes. Very promising results.
Can you let us know which fixes you applied?
I just applied the recipe from the other thread: the XML parser + the new template.
(Translated by Gemini)
Single Spark, vLLM FP8 + MTP-3: concurrency scaling under pressure
Thanks to everyone in this thread, running the stack cosinus and Turrican described (eugr/spark-vllm-docker, vLLM 0.19.1rc1.dev337+g17d87168d, Qwen/Qwen3.6-35B-A3B-FP8).
Config (only interesting flags):
- max-model-len 262144 --max-num-batched-tokens 16384
- gpu-memory-utilization 0.7
- kv-cache-dtype fp8 --load-format fastsafetensors
- attention-backend flashinfer --enable-prefix-caching
- speculative-config β{βmethodβ:βmtpβ,βnum_speculative_tokensβ:3}β
I tried num_speculative_tokens 2, 3, 4. 3 is the sweet spot. At 4 the acceptance rate collapses and throughput drops back below baseline. At 3, acceptance length stays ~2.77 across the whole concurrency
sweep.
Single-client (5-prompt coding suite, T=0, 512 tok each):
ββββββββββββββββ¬ββββββββββββ¬βββββββββββββ
β Config β avg tok/s β peak tok/s β
ββββββββββββββββΌββββββββββββΌβββββββββββββ€
β FP8 baseline β 51.2 β 51.4 β
ββββββββββββββββΌββββββββββββΌβββββββββββββ€
β FP8 + MTP-2 β 58.6 β 63.0 β
ββββββββββββββββΌββββββββββββΌβββββββββββββ€
β FP8 + MTP-3 β 63.9 β 67.8 β
ββββββββββββββββΌββββββββββββΌβββββββββββββ€
β FP8 + MTP-4 β 52.9 β 61.5 β
ββββββββββββββββ΄ββββββββββββ΄βββββββββββββ
My 51.2 baseline lines up with cosinusβs 52.7, so same ballpark.
Random dataset, tg128:
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββββββ
β Test β Agg out tok/s β TPOT mean β TTFT mean β Accept len β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β pp2048 c=1 β 5020 t/s total β β β 410 ms β β β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β tg128 @ d8192 c=1 β 32.6 β 16.1 ms β 1878 ms β β β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β tg128 c=2 β 78.7 β 21.2 ms β 551 ms β 2.78 β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β tg128 c=4 β 106.4 β 27.4 ms β 1157 ms β 2.72 β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β tg128 c=8 β 196.7 β 36.4 ms β 344 ms β 2.73 β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β tg128 c=16 β 286.9 β 49.6 ms β 500 ms β 2.71 β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β tg128 c=32 β 411.6 β 67.9 ms β 633 ms β 2.77 β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β mixed 1024in/256out c=16 β 250 out / 1264 total β 54.8 ms β 1417 ms β 2.77 β
ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββ
Takeaways (AI generated):
- MTP-3 keeps working under load β acceptance rate stable 57β59% from c=1 to c=32. Not just a single-client trick.
- Aggregate output scales ~5Γ from c=2 β c=32 (for 16Γ concurrency). Saturation point looks like ~c=32 at P99 TPOT 93 ms.
- pp2048 is unchanged at 5020 t/s with MTP-3 on β no prefill penalty.
- At 8k context decode drops ~40% (52 β 33 t/s). Prefix caching recovers most of that on multi-turn chat.
- For serving multiple users on one Spark, the realistic mixed workload (1024 in / 256 out @ c=16) gives ~1.26k total tok/s β very usable.
Serapis: your tg128 of 76 on dual Spark made me curious, is that decode TPOT from vllm bench latency? Because aggregate-output-per-client at c=1 caps around 52, but if measured TPOT the same way (128 tok / (e2el β ttft)), also get ~55β60 tok/s which is closer to your number.
I just tested mmangkad/Qwen3.6-35B-A3B-NVFP4 with a few know variations flags
--trust-remote-code \
--quantization fp4 \
--moe-backend marlin \
--async-scheduling \
results where ~35 t/s , down from ~52t/s with FP8 script above.
With Qwen/Qwen3.6-35B-A3B-FP8 and flag: or any variation of speculative-config above averaged ~20-25t/s
--speculative-config '{{"method":"mtp","num_speculative_tokens":3}}' \
Could we have nailed it on the 1st try?
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-------------------------|-----------------:|-----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 | 6219.08 ΓΒ± 85.05 | | 407.92 ΓΒ± 4.52 | 329.59 ΓΒ± 4.52 | 408.02 ΓΒ± 4.53 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg32 | 51.86 ΓΒ± 0.08 | 53.54 ΓΒ± 0.08 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_pp @ d4096 | 5455.55 ΓΒ± 520.28 | | 836.76 ΓΒ± 77.46 | 758.43 ΓΒ± 77.46 | 836.84 ΓΒ± 77.46 |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_tg @ d4096 | 51.59 ΓΒ± 0.23 | 53.26 ΓΒ± 0.23 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d4096 | 2798.01 ΓΒ± 21.56 | | 810.33 ΓΒ± 5.61 | 731.99 ΓΒ± 5.61 | 810.40 ΓΒ± 5.62 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg32 @ d4096 | 51.87 ΓΒ± 0.29 | 53.54 ΓΒ± 0.30 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_pp @ d8192 | 6381.58 ΓΒ± 47.97 | | 1362.26 ΓΒ± 9.69 | 1283.92 ΓΒ± 9.69 | 1362.35 ΓΒ± 9.68 |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_tg @ d8192 | 51.58 ΓΒ± 0.31 | 53.25 ΓΒ± 0.32 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d8192 | 2645.55 ΓΒ± 6.98 | | 852.47 ΓΒ± 2.04 | 774.14 ΓΒ± 2.04 | 852.56 ΓΒ± 2.04 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg32 @ d8192 | 51.30 ΓΒ± 0.06 | 52.96 ΓΒ± 0.06 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_pp @ d16384 | 5693.38 ΓΒ± 19.62 | | 2956.33 ΓΒ± 9.89 | 2878.00 ΓΒ± 9.89 | 2956.39 ΓΒ± 9.90 |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_tg @ d16384 | 51.07 ΓΒ± 0.25 | 52.72 ΓΒ± 0.26 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d16384 | 2398.90 ΓΒ± 8.09 | | 932.07 ΓΒ± 2.88 | 853.73 ΓΒ± 2.88 | 932.16 ΓΒ± 2.88 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg32 @ d16384 | 50.49 ΓΒ± 0.04 | 52.13 ΓΒ± 0.04 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_pp @ d32768 | 4955.89 ΓΒ± 9.58 | | 6690.50 ΓΒ± 12.93 | 6612.16 ΓΒ± 12.93 | 6690.58 ΓΒ± 12.94 |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_tg @ d32768 | 50.65 ΓΒ± 0.11 | 52.29 ΓΒ± 0.12 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d32768 | 2077.41 ΓΒ± 13.80 | | 1064.22 ΓΒ± 6.52 | 985.89 ΓΒ± 6.52 | 1064.30 ΓΒ± 6.52 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg32 @ d32768 | 49.89 ΓΒ± 0.08 | 51.50 ΓΒ± 0.08 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_pp @ d65535 | 3999.49 ΓΒ± 2.21 | | 16464.43 ΓΒ± 9.06 | 16386.10 ΓΒ± 9.06 | 16464.51 ΓΒ± 9.05 |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_tg @ d65535 | 45.99 ΓΒ± 0.33 | 47.56 ΓΒ± 0.32 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d65535 | 1774.20 ΓΒ± 13.42 | | 1232.72 ΓΒ± 8.70 | 1154.39 ΓΒ± 8.70 | 1232.80 ΓΒ± 8.69 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg32 @ d65535 | 46.04 ΓΒ± 0.35 | 47.62 ΓΒ± 0.33 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_pp @ d100000 | 3289.23 ΓΒ± 1.78 | | 30481.02 ΓΒ± 16.28 | 30402.69 ΓΒ± 16.28 | 30481.09 ΓΒ± 16.28 |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_tg @ d100000 | 43.67 ΓΒ± 0.20 | 45.18 ΓΒ± 0.20 | | | |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d100000 | 1053.47 ΓΒ± 3.67 | | 2022.41 ΓΒ± 6.76 | 1944.07 ΓΒ± 6.76 | 2022.46 ΓΒ± 6.78 |
| Qwen/Qwen3.6-35B-A3B-FP8 | tg32 @ d100000 | 43.47 ΓΒ± 0.28 | 45.01 ΓΒ± 0.27 | | | |
Itβs not bad at all, qwen3.6 created these HTML games with working controls, pretty good!
- KV cache: 1,530,704 tokens (vs ~400K with FP8 KV cache)
TurboQuant hybrid (PR 39931) is working!
Do you mind posting a llama benchy, too?
I get pretty good results without MTP but as soon as I add MTP my token generation drops to 20-22 t/s.
I tried a couple of different variants including just a single Spark and was not able to get MTP to work properly.
@eugr I remember you shared about challenges with llama-benchy and MTP. Could this be what I run into? Do you have a recommendation how to measure the performance between speculative decoding and non-speculative decoding setups?
Qwen3.6 DFlash published.
Single spark tested with DFlash
βββ Benchmark βββ
[β] Model: Qwen/Qwen3.6-35B-A3B-FP8
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Benchmark: Qwen3.6-35B-A3B-FP8 β 2026-04-17 15:08
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Warm-up⦠done
ββ Sequential (1 request) ββββββββββββββββββββββββββββββ
Run 1/2:
[Q&A ] 256 tokens in 4.01s = 63.8 tok/s
[Code ] 512 tokens in 6.15s = 83.2 tok/s
[JSON ] 1024 tokens in 10.06s = 101.7 tok/s
[Math ] 32 tokens in .38s = 84.2 tok/s
[LongCode ] 2048 tokens in 25.16s = 81.3 tok/s
Run 2/2:
[Q&A ] 256 tokens in 3.96s = 64.5 tok/s
[Code ] 512 tokens in 6.16s = 83.0 tok/s
[JSON ] 1024 tokens in 10.09s = 101.4 tok/s
[Math ] 32 tokens in .38s = 83.7 tok/s
[LongCode ] 2048 tokens in 25.07s = 81.6 tok/s
ββ Concurrent (4 parallel requests) βββββββββββββββββββββββββββ
Sending 4 requests simultaneously, measuring total throughputβ¦
[req1 ] 1024 tokens = 49.8 tok/s (end-to-end)
[req2 ] 1024 tokens = 49.8 tok/s (end-to-end)
[req3 ] 1024 tokens = 46.3 tok/s (end-to-end)
[req4 ] 1024 tokens = 46.3 tok/s (end-to-end)
Total: 4096 tokens in 22.14s
Total throughput: 184.9 tok/s (4 requests completed)
Throughput with MTP right now depends on somewhat more trivial benchmarks and vibe checks.
llama-benchy will report only the actual raw tokens. It will be lower than without MTP, and not reflect reality.
Measuring throughput with MTP requires a different approach, and it is on eugrβs radar.
This is tested on DGX Spark? A single one?
Iβm working on that. It is somewhat trivial for fixed prompts, but much more difficult to implement at varying context lengths - I have a few ideas and will try to work on them next week once Iβm done with my current backlog.
Yes,One spark solo
Do you have a recipe for it? I would like to try running it on mine too.
The DFlash model download needs to send request.
./spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 \
--solo \
--apply-mod ./spark-vllm-docker/mods/fix-qwen3.5-chat-template \
-d -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
-e HF_TOKEN=${HF_TOKEN} \
exec vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 262144 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.6 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--load-format fastsafetensors \
--enable-prefix-caching \
--chat-template unsloth.jinja \
--speculative-config β{βmethodβ: βdflashβ, βmodelβ: βz-lab/Qwen3.6-35B-A3B-DFlashβ, βnum_speculative_tokensβ: 15}β \
--attention-backend flash_attn
Impressive, can you provide a llm-bench run so we are all comparing the same thing>
uvx llama-benchy --base-url http://localhost:8000/v1 --model Qwen/Qwen3.6-35B-A3B-FP8 --depth 0 4096 8192 16384 32768 65535 100000 --adapt-prompt --latency-mode generation --enable-prefix-caching
( even if its not perfect for the MTP testing yet⦠)
Hello, do you also run into a lot of issues with AI for web development? With me, about half the time it gets the tool parameters wrong. And when it doesnβt (which isnβt often), it writes out complete scripts of several hundred lines, only to realize afterward, βThis is too complex, Iβll do it differentlyβ¦β This can happen multiple times in a rowβ¦ That said, the code quality itself is much better than 3.5 with the same parameters, but it still struggles with logic. Iβm using vLLM 0.19.1 dev with a patched reasoning/tool parser cause qwen3_xml itβs still buggy.