Qwen/Qwen3.6-35B-A3B (and FP8) has landed

serapis · April 16, 2026, 5:35pm

If you have 8 Sparks ;-)

cosinus · April 16, 2026, 6:09pm

Single GPU (no distributed inference): if the model fits on a single GPU, distributed inference is probably unnecessary. Run inference on that GPU.

Single-node multi-GPU using tensor parallel inference: if the model is too large for a single GPU but fits on a single node with multiple GPUs, use tensor parallelism. For example, set tensor_parallel_size=4 when using a node with 4 GPUs.

The TP value equals the number of GPUs you are using. You don’t have to specify that argument when you are using only one Spark.

If you share the vLLM version, recipe or command with arguments you have used and may be even share the log output of vLLM someone in here might be able to help you. ;-)

Turrican · April 16, 2026, 6:35pm

Been doing a bit of tests serving enabling MTP, and frankly it does seem to work:

── Run 1/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 4.34s = 58.9 tok/s (prompt: 23)
  [Code] 512 tokens in 8.11s = 63.1 tok/s (prompt: 30)
  [JSON] 1024 tokens in 15.84s = 64.6 tok/s (prompt: 48)
  [Math] 64 tokens in 1.02s = 62.7 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 31.00s = 66.0 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 4.35s = 58.8 tok/s (prompt: 23)
  [Code] 512 tokens in 8.14s = 62.8 tok/s (prompt: 30)
  [JSON] 1024 tokens in 15.64s = 65.4 tok/s (prompt: 48)
  [Math] 64 tokens in 1.02s = 62.7 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 31.03s = 66.0 tok/s (prompt: 37)

The recipe I’m testing is:

# Recipe: Qwen/Qwen3.6-35B-A3B-FP8
# Qwen/Qwen3.6-35B-A3B model in native FP8 format


recipe_version: "1"
name: Qwen35-35B-A3B
description: vLLM serving Qwen3.6-35B-A3B-FP8

# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8

solo_only: true

# Container image to use
container: vllm-node

# Mods
mods:
  - mods/fix-qwen3.5-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.8
  max_model_len: 262144
  max_num_batched_tokens: 32768

# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

# The vLLM serve command template
command: |
  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --host {host} \
    --port {port} \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --enable-auto-tool-choice \
    --generation-config auto \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --kv-cache-dtype fp8 \
    --load-format fastsafetensors \
    --attention-backend flashinfer \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --speculative-config '{{"method":"qwen3_next_mtp","num_speculative_tokens":2}}' \
    --override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}'

paxren2020 · April 16, 2026, 8:04pm

My testing shows that 3.6 tool call stability matches 3.5 with applied fixes. Very promising results.

Digital_David · April 16, 2026, 8:11pm

Can you let us know which fixes you applied?

paxren2020 · April 16, 2026, 8:37pm

I just applied the recipe from the other thread: the XML parser + the new template.

(Translated by Gemini)

timhbl · April 16, 2026, 8:45pm

Single Spark, vLLM FP8 + MTP-3: concurrency scaling under pressure

Thanks to everyone in this thread, running the stack cosinus and Turrican described (eugr/spark-vllm-docker, vLLM 0.19.1rc1.dev337+g17d87168d, Qwen/Qwen3.6-35B-A3B-FP8).

Config (only interesting flags):

max-model-len 262144 --max-num-batched-tokens 16384
gpu-memory-utilization 0.7
kv-cache-dtype fp8 --load-format fastsafetensors
attention-backend flashinfer --enable-prefix-caching
speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:3}’

I tried num_speculative_tokens 2, 3, 4. 3 is the sweet spot. At 4 the acceptance rate collapses and throughput drops back below baseline. At 3, acceptance length stays ~2.77 across the whole concurrency
sweep.

Single-client (5-prompt coding suite, T=0, 512 tok each):

┌──────────────┬───────────┬────────────┐
│ Config │ avg tok/s │ peak tok/s │
├──────────────┼───────────┼────────────┤
│ FP8 baseline │ 51.2 │ 51.4 │
├──────────────┼───────────┼────────────┤
│ FP8 + MTP-2 │ 58.6 │ 63.0 │
├──────────────┼───────────┼────────────┤
│ FP8 + MTP-3 │ 63.9 │ 67.8 │
├──────────────┼───────────┼────────────┤
│ FP8 + MTP-4 │ 52.9 │ 61.5 │
└──────────────┴───────────┴────────────┘

My 51.2 baseline lines up with cosinus’s 52.7, so same ballpark.

Random dataset, tg128:

┌──────────────────────────┬──────────────────────┬───────────┬───────────┬────────────┐
│ Test │ Agg out tok/s │ TPOT mean │ TTFT mean │ Accept len │
├──────────────────────────┼──────────────────────┼───────────┼───────────┼────────────┤
│ pp2048 c=1 │ 5020 t/s total │ — │ 410 ms │ — │
├──────────────────────────┼──────────────────────┼───────────┼───────────┼────────────┤
│ tg128 @ d8192 c=1 │ 32.6 │ 16.1 ms │ 1878 ms │ — │
├──────────────────────────┼──────────────────────┼───────────┼───────────┼────────────┤
│ tg128 c=2 │ 78.7 │ 21.2 ms │ 551 ms │ 2.78 │
├──────────────────────────┼──────────────────────┼───────────┼───────────┼────────────┤
│ tg128 c=4 │ 106.4 │ 27.4 ms │ 1157 ms │ 2.72 │
├──────────────────────────┼──────────────────────┼───────────┼───────────┼────────────┤
│ tg128 c=8 │ 196.7 │ 36.4 ms │ 344 ms │ 2.73 │
├──────────────────────────┼──────────────────────┼───────────┼───────────┼────────────┤
│ tg128 c=16 │ 286.9 │ 49.6 ms │ 500 ms │ 2.71 │
├──────────────────────────┼──────────────────────┼───────────┼───────────┼────────────┤
│ tg128 c=32 │ 411.6 │ 67.9 ms │ 633 ms │ 2.77 │
├──────────────────────────┼──────────────────────┼───────────┼───────────┼────────────┤
│ mixed 1024in/256out c=16 │ 250 out / 1264 total │ 54.8 ms │ 1417 ms │ 2.77 │
└──────────────────────────┴──────────────────────┴───────────┴───────────┴────────────┘

Takeaways (AI generated):

MTP-3 keeps working under load — acceptance rate stable 57–59% from c=1 to c=32. Not just a single-client trick.
Aggregate output scales ~5× from c=2 → c=32 (for 16× concurrency). Saturation point looks like ~c=32 at P99 TPOT 93 ms.
pp2048 is unchanged at 5020 t/s with MTP-3 on — no prefill penalty.
At 8k context decode drops ~40% (52 → 33 t/s). Prefix caching recovers most of that on multi-turn chat.
For serving multiple users on one Spark, the realistic mixed workload (1024 in / 256 out @ c=16) gives ~1.26k total tok/s — very usable.

Serapis: your tg128 of 76 on dual Spark made me curious, is that decode TPOT from vllm bench latency? Because aggregate-output-per-client at c=1 caps around 52, but if measured TPOT the same way (128 tok / (e2el − ttft)), also get ~55–60 tok/s which is closer to your number.

Digital_David · April 16, 2026, 9:39pm

I just tested mmangkad/Qwen3.6-35B-A3B-NVFP4 with a few know variations flags

--trust-remote-code \
--quantization fp4 \
--moe-backend marlin \
--async-scheduling \

results where ~35 t/s , down from ~52t/s with FP8 script above.

With Qwen/Qwen3.6-35B-A3B-FP8 and flag: or any variation of speculative-config above averaged ~20-25t/s

--speculative-config '{{"method":"mtp","num_speculative_tokens":3}}' \

Could we have nailed it on the 1st try?

| model                    |             test |              t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:-------------------------|-----------------:|-----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| Qwen/Qwen3.6-35B-A3B-FP8 |           pp2048 |  6219.08 Â± 85.05 |              |    407.92 Â± 4.52 |    329.59 Â± 4.52 |    408.02 Â± 4.53 |
| Qwen/Qwen3.6-35B-A3B-FP8 |             tg32 |     51.86 Â± 0.08 | 53.54 Â± 0.08 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   ctx_pp @ d4096 | 5455.55 Â± 520.28 |              |   836.76 Â± 77.46 |   758.43 Â± 77.46 |   836.84 Â± 77.46 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   ctx_tg @ d4096 |     51.59 Â± 0.23 | 53.26 Â± 0.23 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   pp2048 @ d4096 |  2798.01 Â± 21.56 |              |    810.33 Â± 5.61 |    731.99 Â± 5.61 |    810.40 Â± 5.62 |
| Qwen/Qwen3.6-35B-A3B-FP8 |     tg32 @ d4096 |     51.87 Â± 0.29 | 53.54 Â± 0.30 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   ctx_pp @ d8192 |  6381.58 Â± 47.97 |              |   1362.26 Â± 9.69 |   1283.92 Â± 9.69 |   1362.35 Â± 9.68 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   ctx_tg @ d8192 |     51.58 Â± 0.31 | 53.25 Â± 0.32 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |   pp2048 @ d8192 |   2645.55 Â± 6.98 |              |    852.47 Â± 2.04 |    774.14 Â± 2.04 |    852.56 Â± 2.04 |
| Qwen/Qwen3.6-35B-A3B-FP8 |     tg32 @ d8192 |     51.30 Â± 0.06 | 52.96 Â± 0.06 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_pp @ d16384 |  5693.38 Â± 19.62 |              |   2956.33 Â± 9.89 |   2878.00 Â± 9.89 |   2956.39 Â± 9.90 |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_tg @ d16384 |     51.07 Â± 0.25 | 52.72 Â± 0.26 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d16384 |   2398.90 Â± 8.09 |              |    932.07 Â± 2.88 |    853.73 Â± 2.88 |    932.16 Â± 2.88 |
| Qwen/Qwen3.6-35B-A3B-FP8 |    tg32 @ d16384 |     50.49 Â± 0.04 | 52.13 Â± 0.04 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_pp @ d32768 |   4955.89 Â± 9.58 |              |  6690.50 Â± 12.93 |  6612.16 Â± 12.93 |  6690.58 Â± 12.94 |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_tg @ d32768 |     50.65 Â± 0.11 | 52.29 Â± 0.12 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d32768 |  2077.41 Â± 13.80 |              |   1064.22 Â± 6.52 |    985.89 Â± 6.52 |   1064.30 Â± 6.52 |
| Qwen/Qwen3.6-35B-A3B-FP8 |    tg32 @ d32768 |     49.89 Â± 0.08 | 51.50 Â± 0.08 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_pp @ d65535 |   3999.49 Â± 2.21 |              |  16464.43 Â± 9.06 |  16386.10 Â± 9.06 |  16464.51 Â± 9.05 |
| Qwen/Qwen3.6-35B-A3B-FP8 |  ctx_tg @ d65535 |     45.99 Â± 0.33 | 47.56 Â± 0.32 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 |  pp2048 @ d65535 |  1774.20 Â± 13.42 |              |   1232.72 Â± 8.70 |   1154.39 Â± 8.70 |   1232.80 Â± 8.69 |
| Qwen/Qwen3.6-35B-A3B-FP8 |    tg32 @ d65535 |     46.04 Â± 0.35 | 47.62 Â± 0.33 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_pp @ d100000 |   3289.23 Â± 1.78 |              | 30481.02 Â± 16.28 | 30402.69 Â± 16.28 | 30481.09 Â± 16.28 |
| Qwen/Qwen3.6-35B-A3B-FP8 | ctx_tg @ d100000 |     43.67 Â± 0.20 | 45.18 Â± 0.20 |                  |                  |                  |
| Qwen/Qwen3.6-35B-A3B-FP8 | pp2048 @ d100000 |   1053.47 Â± 3.67 |              |   2022.41 Â± 6.76 |   1944.07 Â± 6.76 |   2022.46 Â± 6.78 |
| Qwen/Qwen3.6-35B-A3B-FP8 |   tg32 @ d100000 |     43.47 Â± 0.28 | 45.01 Â± 0.27 |                  |                  |                  |

TheAwakenOne · April 17, 2026, 12:11am

It’s not bad at all, qwen3.6 created these HTML games with working controls, pretty good!

grindstone · April 17, 2026, 12:18am

KV cache: 1,530,704 tokens (vs ~400K with FP8 KV cache)

TurboQuant hybrid (PR 39931) is working!

serapis · April 17, 2026, 3:27am

Do you mind posting a llama benchy, too?

I get pretty good results without MTP but as soon as I add MTP my token generation drops to 20-22 t/s.

I tried a couple of different variants including just a single Spark and was not able to get MTP to work properly.

@eugr I remember you shared about challenges with llama-benchy and MTP. Could this be what I run into? Do you have a recommendation how to measure the performance between speculative decoding and non-speculative decoding setups?

say3 · April 17, 2026, 3:40am

Qwen3.6 DFlash published.

Single spark tested with DFlash

═══ Benchmark ═══

[✓] Model: Qwen/Qwen3.6-35B-A3B-FP8

╔══════════════════════════════════════════════════════╗

║ Benchmark: Qwen3.6-35B-A3B-FP8 — 2026-04-17 15:08

╚══════════════════════════════════════════════════════╝

Warm-up… done

── Sequential (1 request) ──────────────────────────────

Run 1/2:

[Q&A ] 256 tokens in 4.01s = 63.8 tok/s

[Code ] 512 tokens in 6.15s = 83.2 tok/s

[JSON ] 1024 tokens in 10.06s = 101.7 tok/s

[Math ] 32 tokens in .38s = 84.2 tok/s

[LongCode ] 2048 tokens in 25.16s = 81.3 tok/s

Run 2/2:

[Q&A ] 256 tokens in 3.96s = 64.5 tok/s

[Code ] 512 tokens in 6.16s = 83.0 tok/s

[JSON ] 1024 tokens in 10.09s = 101.4 tok/s

[Math ] 32 tokens in .38s = 83.7 tok/s

[LongCode ] 2048 tokens in 25.07s = 81.6 tok/s

── Concurrent (4 parallel requests) ───────────────────────────

Sending 4 requests simultaneously, measuring total throughput…

[req1 ] 1024 tokens = 49.8 tok/s (end-to-end)

[req2 ] 1024 tokens = 49.8 tok/s (end-to-end)

[req3 ] 1024 tokens = 46.3 tok/s (end-to-end)

[req4 ] 1024 tokens = 46.3 tok/s (end-to-end)

Total: 4096 tokens in 22.14s

Total throughput: 184.9 tok/s (4 requests completed)

joshua.dale.warner · April 17, 2026, 4:37am

Throughput with MTP right now depends on somewhat more trivial benchmarks and vibe checks.

llama-benchy will report only the actual raw tokens. It will be lower than without MTP, and not reflect reality.

Measuring throughput with MTP requires a different approach, and it is on eugr’s radar.

digiegg · April 17, 2026, 5:36am

This is tested on DGX Spark? A single one?

eugr · April 17, 2026, 6:19am

I’m working on that. It is somewhat trivial for fixed prompts, but much more difficult to implement at varying context lengths - I have a few ideas and will try to work on them next week once I’m done with my current backlog.

say3 · April 17, 2026, 6:34am

Yes,One spark solo

DColt · April 17, 2026, 6:50am

Do you have a recipe for it? I would like to try running it on mine too.

say3 · April 17, 2026, 7:05am

The DFlash model download needs to send request.

./spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 \

--solo \

--apply-mod ./spark-vllm-docker/mods/fix-qwen3.5-chat-template \

-d -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \

-e HF_TOKEN=${HF_TOKEN} \

exec vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \

--host 0.0.0.0 \

--port 8000 \

--max-model-len 262144 \

--max-num-batched-tokens 32768 \

--gpu-memory-utilization 0.6 \

--enable-auto-tool-choice \

--tool-call-parser qwen3_xml \

--load-format fastsafetensors \

--enable-prefix-caching \

--chat-template unsloth.jinja \

--speculative-config ‘{“method”: “dflash”, “model”: “z-lab/Qwen3.6-35B-A3B-DFlash”, “num_speculative_tokens”: 15}’ \

--attention-backend flash_attn

Digital_David · April 17, 2026, 11:43am

Impressive, can you provide a llm-bench run so we are all comparing the same thing>

uvx llama-benchy --base-url http://localhost:8000/v1 --model Qwen/Qwen3.6-35B-A3B-FP8 --depth 0 4096 8192 16384 32768 65535 100000 --adapt-prompt --latency-mode generation --enable-prefix-caching

( even if its not perfect for the MTP testing yet… )

jbourny · April 17, 2026, 12:12pm

Hello, do you also run into a lot of issues with AI for web development? With me, about half the time it gets the tool parameters wrong. And when it doesn’t (which isn’t often), it writes out complete scripts of several hundred lines, only to realize afterward, “This is too complex, I’ll do it differently…” This can happen multiple times in a row… That said, the code quality itself is much better than 3.5 with the same parameters, but it still struggles with logic. I’m using vLLM 0.19.1 dev with a patched reasoning/tool parser cause qwen3_xml it’s still buggy.

Topic		Replies	Views
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	327	9486	April 23, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	14986	March 24, 2026
Qwen3.6-27B is out! DGX Spark / GB10 agentic-ai	15	1961	April 23, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	4898	March 16, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	18	1264	April 16, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	8715	March 24, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9078	April 9, 2026
Qwen3.5-35B-A3B optimizations on single Spark DGX Spark / GB10 Projects	39	1352	April 20, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	229	6907	April 20, 2026
Bfloat16 Quality = Speed? DGX Spark / GB10	24	907	April 21, 2026

Qwen/Qwen3.6-35B-A3B (and FP8) has landed

Related topics