Got error:
vllm serve: error: argument --speculative-config/-sc: Value {โmethodโ:โdflashโ,โmodelโ: โz-lab/Qwen3.6-35B-A3B-DFlashโ, โnum_speculative_tokensโ:4} cannot be converted to <function loads at 0xed81d62993a0>.
What version of vLLM are you using with DFlash?
Iโm running @eugr community docker, updated with the latest from today
vllm: 0.19.2rc1.dev213+g9558f4390.d20260426
I had to manually replace ever โโ , โโโ and - (quotes and dashes) everytime I copy-paste from this forum.
do that and should work
Could not see the raise in t/sโฆ investigating the differences in configsโฆ
Might want to scroll up and reread, Dflash configs and test have been completed previously. :=-)
Testing such config for exiting infinite loop thinking:
โoverride-generation-config โ{{โtemperatureโ: 0.6, โtop_pโ: 0.95, โtop_kโ: 20, โpresence_penaltyโ: 0.0, โrepetition_penaltyโ: 1.0, โmin_pโ: 0.2}}โ \
Bad news is that min_p is not yet supported by speculative decoding.
Hello, we use latest nvidia vllm (26.04-py3) with docker on single spark with few parameters atm works fine ~51 token/s next step trying with MTP 2/3:
โgpu-memory-utilization 0.8
โdefaut-chat-template-kwargs โ{โenable_thinkingโ: false}โ
โenable-auto-tool-choice
โtool-call-parser qwen3_coder
Hey all โ looking for some help from the dual-Spark crowd.
Running Qwen3.6-35B-A3B-FP8 on a 2ร DGX Spark cluster (CX7 200G direct link, tp=2 over Ray) and consistently seeing ~67 tok/s single-stream decode. Saw @serapis quote 77.74 ยฑ 0.44 tok/s on the same model + topology earlier in this thread โ really hoping someone can help me figure out what Iโm missing.
Numbers
| Metric | Mine | post #5 reference |
|---|---|---|
| Prefill (pp2048-class) | 7,920 tok/s | 7,824 ยฑ 162 |
| Decode (tg128, dual-Spark) | 66.8 tok/s | 77.74 ยฑ 0.44 |
| Decode (tg128, tp=1 single) | 53.6 tok/s | 75.1 (post #11) |
Methodology: 1.5K-token prompt + 128 decode tokens, ignore_eos=True, 5 runs, very tight error bars (~0.5 tok/s). Decode rate is inferred by subtracting short-decode time (tg8 vs tg128) to exclude prefill cost. Prefill matches the published number almost exactly โ only decode is off.
Setup
-
Hardware: 2ร DGX Spark, CX7 200G direct link (RoCE,
NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1) -
Container:
eugr/spark-vllm-dockerwith the latest prebuilt wheel โprebuilt-vllm-currentfrom 2026-04-28 (vLLM 0.20.1rc1.dev23+gde3da0b97),flashinfer 0.6.9 -
vLLM args (production recipe):
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--host 0.0.0.0 --port 8000 \
--max-model-len 262144 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.7 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--kv-cache-dtype fp8 \
--load-format fastsafetensors \
--attention-backend flashinfer \
--enable-prefix-caching \
--chat-template unsloth.jinja \
-tp 2 \
--distributed-executor-backend ray
What Iโve already tried (no decode change either way)
-
Stripped config โ dropped tool-parser, reasoning-parser, prefix-caching,
--chat-template. Lowered--max-model-lento 32768,--max-num-batched-tokensto 8192,--load-format=instanttensor. Same 67 tok/s decode. -
vLLM bump โ was on
0.19.2rc1.dev213+g9558f4390, rebuilt to0.20.1rc1.dev23+gde3da0b97(post-v0.20.0 stable). Same number. -
tp=1 isolation โ bare config on a single Spark: 53.6 tok/s decode. Thatโs 22 tok/s below the post #11 single-Spark figure, so the gap isnโt interconnect-related โ it shows up on the single-node path too.
During steady-state decode I sampled nvidia-smi per 500 ms:
-
GPU clock: 2405 MHz (locked at max boost โ no throttling)
-
Power draw: 21โ33 W, average ~28 W (vs ~140 W TDP)
So SMs are sitting idle most of the time waiting on memory loads โ itโs not a clock or thermal issue. Decode efficiency vs theoretical ceiling (~91 tok/s from 273 GB/s รท 3 GB-per-A3B-token):
-
Mine: 73%
-
Reference: 85%
That ~12% efficiency gap is real and I canโt reach it via config.
Question
What am I missing? Specifically:
-
Container/image โ for those of you hitting 75+ on single-Spark or ~78 on dual, what container did you actually run? The thread pins vLLM args but not the build. Is anyone on NGC vLLM (
nvcr.io/nvidia/vllm:26.03.post1-py3)? A specific upstream vLLM commit? A custom flashinfer build? -
MoE kernel selection โ anything I should check / pin / override? Any env vars beyond
VLLM_MARLIN_USE_ATOMIC_ADD=1? -
Power / firmware โ is there a Spark BIOS or driver tweak (power profile, persistence mode, etc.) that meaningfully changes the picture?
-
Anything else โ happy to run further bench variations and share results if it helps narrow it down.
Thanks in advance โ really appreciate any pointers.
intel released the fixed version: Intel/Qwen3.6-35B-A3B-int4-mixed-AutoRound ยท Hugging Face
Looks like a mistake.. FP8 size is 38Gb, this one (int4) - 43Gb.
Lol how many times is Intel gonna mess up this quant? Itโs been pulled again
Hello,
I canโt understand why no one is saying that Qwen3.6 or vLLM isnโt stable at all. Iโve tried all the vLLM versions and almost all the Qwen3.6 35B models (FP8, INT4, NVFP4 with and without distillation), and they all have the same problem: Endless repetition during reflections, the model starts creating a large file, then once itโs finished, it decides, โOh, actually, no, I donโt like it, Iโll do it differently,โ and this can go on for several times. The same thing happens when it corrects a file; it will correct it multiple times. Sometimes it will even reflect, say something, use a tool, reflect, say the same thing again, use the same tool with the same parameters, and so on dozens of times, even indefinitely.
I tried every possible setting, starting with the recommended one. The last one that seemed stable was `{repetition_penalty:1.1,temperature:0.4}`, but ultimately, after a while, around 30k contexts, it keeps repeating.
This happens with or without `preserve_thinking`, whether in Claude Code, VS Code, or even custom-built assistants.
My second ongoing issue, whether itโs vLLM Nightly or even the latest stable version vLLM 0.20 and vLLM eugr (which is based on the nightly version), is the tool calls. The tools end up as XML in the thinking_content output, forcing me to patch them everywhere. Qwen has been releasing templates for two years, and no one has been able to get vLLM working out of the box with this fix. I donโt understandโฆ
Could someone please create a ZIP file containing all the patches or commands needed to make it work properly? Iโm starting to despair :(
Have you tried a different chat template? I do see what you say in OpenWebUI when testing a few things, but I mostly use Claude Code (and OpenCode in minor things, but Claude Code is more stable for me than anything else, albeit slower) with FP8 version of this model and works really well for me.
I use this model for speed and I double check with 3.5 122b-a10b-Hybrid which gives me half of the speed but a bit better quality (though this 3.6-35B itโs actually pretty close to my workflow).
Hereโs my recipe:
#Recipe: Qwen/Qwen3.6-35B-A3B-FP8
#Qwen/Qwen3.6-35B-A3B model in native FP8 format
recipe_version: "1"
name: Qwen35-35B-A3B-Dflash
description: vLLM serving Qwen3.6-35B-A3B-FP8
#HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8
solo_only: true
#Container image to use
container: vllm-node-tf5
#Mod
mods:
- mods/fix-qwen3.5-enhanced-chat-template
#Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
gpu_memory_utilization: 0.75
max_model_len: 524288
# max_model_len: 262144
max_num_batched_tokens: 32768
max-num-seqs: 8
#Environment variables
env:
VLLM_MARLIN_USE_ATOMIC_ADD: 1
VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
#The vLLM serve command template
command: |
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name Qwen3.6-35B-A3B-FP8-DFlash \
--host {host} \
--port {port} \
--max-model-len {max_model_len} \
--max-num-batched-tokens {max_num_batched_tokens} \
--gpu-memory-utilization {gpu_memory_utilization} \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--load-format fastsafetensors \
--default-chat-template-kwargs '{{"preserve_thinking": true}}' \
--speculative-config '{{"method": "dflash", "model":"z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 4}}' \
--attention-backend flash_attn \
--enable-prefix-caching
Look closely to the chat template sections and maybe try those parameters?
Update: after firmware updates and vllm 20.. about ~6% increase in LLM speed and tools usage Much faster..
@azampatti thanks for posting this. Would love to give it a try as I have a similar use-case, but running into the following error.
Warning: Mod path not found: ./spark-vllm-docker/mods/fix-qwen3.5-enhanced-chat-template
I grabbed the latest spark-vllm-docker.git so I think this might be a custom implementation that isnโt in there (yet?).
There are few others built-in but not sure if suitable replacements.
ls spark-vllm-docker/mods/ |grep -i qwen
fix-qwen3.5-autoround
fix-qwen3.5-chat-template
fix-qwen35-tp4-marlin
fix-qwen3-coder-next
fix-qwen3-next-autoround
Thanks
Not sure what youโre comparing to, but if youโre relying solely on 35b for everything then you should expect a bit of a bumpy experience. Maybe try using a larger model or even 27b for design/plan, and let 35b speedrun through the grunt work.
Thereโs also another thread suggesting setting dtype to bfloat16. This might help with large context work.
check @whpthomas 's post here Bfloat16 Quality = Speed? :) He shares the the how-to to get this template up and running
Thatโs the receipt I am using with really good success so far with OpenCLaw.
Howโs this for some speed:
tool-eval-bench --base-url http://0.0.0.0:8000 --short --perf
๏ง Tool-Call Benchmark
Server: http://0.0.0.0:8000
Querying http://0.0.0.0:8000/v1/models โฆ โ Intel/Qwen3.6-35B-A3B-int4-AutoRound
โ Warm-up complete (105 ms)
๏ Engine: vLLM 0.20.1rc1.dev55+g3f1a4bb63.d20260429
โญโโโโโโโโโโโโโโโโโโโ โก llama-benchy Throughput Benchmark โโโโโโโโโโโโโโโโโโโโฎ
โ Intel/Qwen3.6-35B-A3B-int4-AutoRound โ
โ pp=[2048] tg=[128] depth=[0, 4096, 8192] concurrency=[1, 2, 4] runs=3 โ
โ latency=generation โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โ Complete โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 27/27 0:01:48
llama-benchy 0.3.7
Estimated latency: 41.5 ms
llama-benchy Results
โโโโโโโโโโโโโโโโโโโโโโโโณโโโโโณโโโโโโโโโโณโโโโโโโโโโณโโโโโโโโโโณโโโโโโโโโโโณโโโโโโโโ
โ โ โ โ โ TTFT โ Total โ
โ Test โ c โ pp t/s โ tg t/s โ (ms) โ (ms) โ Tokens
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ pp2048 tg128 @ d0 โ c1 โ 5,982 โ 71.9 โ 360 โ 2,100 โ 2048+1โฆ
โ pp2048 tg128 @ d0 โ c2 โ 5,852 โ 123.1 โ 616 โ 2,637 โ 2048+1โฆ
โ pp2048 tg128 @ d0 โ c4 โ 6,258 โ 188.3 โ 1,123 โ 3,747 โ 2048+1โฆ
โ pp2048 tg128 @ d4096 โ c1 โ 6,592 โ 70.2 โ 887 โ 2,669 โ 2048+1โฆ
โ pp2048 tg128 @ d4096 โ c2 โ 6,651 โ 120.0 โ 1,662 โ 3,738 โ 2048+1โฆ
โ pp2048 tg128 @ d4096 โ c4 โ 6,807 โ 184.8 โ 3,202 โ 5,881 โ 2048+1โฆ
โ pp2048 tg128 @ d8192 โ c1 โ 6,668 โ 68.4 โ 1,439 โ 3,269 โ 2048+1โฆ
โ pp2048 tg128 @ d8192 โ c2 โ 6,611 โ 110.6 โ 2,745 โ 4,959 โ 2048+1โฆ
โ pp2048 tg128 @ d8192 โ c4 โ 6,665 โ 164.7 โ 5,515 โ 8,398 โ 2048+1โฆ
โโโโโโโโโโโโโโโโโโโโโโโโดโโโโโดโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโ ๏ Benchmark Complete โโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โ
โ Model: Intel/Qwen3.6-35B-A3B-int4-AutoRound โ
โ Score: 87 / 100 โ
โ Rating: โ
โ
โ
โ
Good โ
โ Engine: vLLM 0.20.1rc1.dev55+g3f1a4bb63.d20260429 โ
โ Quantization: INT4-AutoRound โ
โ Max context: 262,144 tokens โ
โ โ
โ โ
12 passed โ ๏ธ 2 partial โ 1 failed โ
โ Points: 26/30 โ
โ โ
โ Quality: 87/100 โ
โ Responsiveness: 70/100 (median turn: 1.7s) โ
โ Deployability: 82/100 (ฮฑ=0.7) โ
โ Weakest: A Tool Selection (67%) โ
โ โ
โ Completed in 72.7s โ tool-eval-bench v1.4.3.1 โ
โ โ
โ ๏ Token Usage: โ
โ Total: 37,725 tokens โ Efficiency: 0.7 pts/1K tokens โ
โ โ
โ โก Throughput: โ
โ Single: 6,668 pp t/s โ 71.9 tg t/s โ TTFT 360ms โ
โ c2: 6,651 pp t/s โ 123.1 tg t/s โ
โ c4: 6,807 pp t/s โ 188.3 tg t/s โ
โ โ
โ โโ How this score is calculated โโ โ
โ โข Each scenario: pass=2pt, partial=1pt, fail=0pt โ
โ โข Category %: earned / max per category โ
โ โข Final score: (total points / max points) ร 100 โ
โ โข Deployability: 0.7รquality + 0.3รresponsiveness โ
โ โข Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
I would like to use the 27b but unfortunately it is too slow :( I was talking about Qwen3.6 35b, I understand that it is not as good as the 27b but it is supposed to at least do something and not loop endlessly until the server crashes. Iโve already tried dtype bfloat16 I guess itโs a little bit better but after a long context itโs still loopingโฆ Iโm trying the azampattiโs receipe, Im building from zero the container to be certain there is not a corrupted file and we will see :)


