The PR is not merged in vllm itself – a fork of vllm apparently pulled and merged the PR.
That’s a solid tool benchmark. Which chat template are you using here?
The only observation here is the median turn at 4.3s might be a bit slow for some interactive work.
I rebuilt everything from scratch and re-tested.
My custom dFlash build:
[Q&A] 256 tokens in 4.77s = 53.6 tok/s (prompt: 23)
[Code] 512 tokens in 8.90s = 57.5 tok/s (prompt: 30)
[JSON] 1024 tokens in 13.79s = 74.2 tok/s (prompt: 48)
[Math] 64 tokens in .77s = 83.1 tok/s (prompt: 29)
[LongCode] 2048 tokens in 31.41s = 65.2 tok/s (prompt: 37)
New –-tf5 build with no extra PRs:
[Q&A] 256 tokens in 4.58s = 55.8 tok/s (prompt: 23)
[Code] 512 tokens in 8.60s = 59.5 tok/s (prompt: 30)
[JSON] 1024 tokens in 13.49s = 75.9 tok/s (prompt: 48)
[Math] 64 tokens in 1.18s = 54.2 tok/s (prompt: 29)
[LongCode] 2048 tokens in 26.27s = 77.9 tok/s (prompt: 37)
Looks like we loose in Math, and we win in LongCode (I ran these three times each, so it’s consistent).
I will test out in my Claude Code workflow which is where normally DFlash falls to the floor for me when pushing the context to the limits.
This was my finding too. I rebuilt from source and found dflash working well without the PR so that’s when I checked the PR status and incorrectly read it had been merged. Something changed because before this dflash did not work well without the PR pulled in. I am using the v6 of the chat template discussed above. I’ll send the link later if you can’t find it. I have been using this since yesterday and it’s been ok but I have not pushed it hard. It sure is difficult to keep track of what is going on inside of VLLM. One thing gets fixed and another is regressed.
Thanks.
The v6 chat template works and the score is 93.
can you share your recipe and what model you are running?
At the same time, the token generation speed of my Qwen3.6-27B reached 100.4 tokens per second.
There are my parameters:
MODEL_PATH="/home/ai/Illusionna/Desktop/models/Qwen3.6-27B-INT4-AutoRound-Intel"
MODEL_NAME=$(basename "$MODEL_PATH")
VLLM_IMAGE="dgx-spark-gb10:vllm0.19.1.dev"
docker run \
--rm -it \
--gpus all \
--name ${MODEL_NAME} \
-v ${MODEL_PATH}:/models/${MODEL_NAME} \
-p 30000:30000 \
--ipc=host \
--entrypoint vllm \
--health-cmd 'curl -sf http://localhost:30000/health || exit 1' \
--health-interval 10s \
--health-timeout 5s \
--health-retries 60 \
--health-start-period 900s \
${VLLM_IMAGE} \
serve /models/${MODEL_NAME} \
--served-model-name ${MODEL_NAME} \
--host 0.0.0.0 \
--port 30000 \
--gpu-memory-utilization 0.8 \
--max-model-len 262144 \
--max-num-seqs 4 \
--max-num-batched-tokens 16384 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--load-format instanttensor \
--attention-backend flashinfer \
--kv-cache-dtype fp8 \
--quantization gptq_marlin \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--trust-remote-code
Downside with INT4 Autoround is slow prompt processing speeds on 27b dense (seems fine on 35b MOE though, INT8 Autoround hits 10k t/s PP on dual sparks for me, INT4 goes even higher). NVFP4 / Paroquant on 27b can hit 3000+ t/s on a single spark while having similar TG to int4.
My current BEST performing recipe is for Qwen3.6-35B-A3B with the latest vLLM (or the DFlash optimized, they trade blows in different tests) + Chat Template 3.6 v9 + Introducing vLLM-Tune — Kernel tuning CLI for vLLM on DGX Spark
I get solid 93/100 score in Tool Eval Bench with nice performance and parallelism.
what are you running to get this benchmark output?
[Q&A] 256 tokens in 4.77s = 53.6 tok/s (prompt: 23)
[Code] 512 tokens in 8.90s = 57.5 tok/s (prompt: 30)
[JSON] 1024 tokens in 13.79s = 74.2 tok/s (prompt: 48)
[Math] 64 tokens in .77s = 83.1 tok/s (prompt: 29)
[LongCode] 2048 tokens in 31.41s = 65.2 tok/s (prompt: 37)
I’m running a slightly modified version from @Albond 's version from GitHub - albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4: Qwen3.5-122B-A10B on DGX Spark: 28.3 → 51 tok/s (+80%) · GitHub
I added parallelism and model detection, but the single-user results are exactly the same if you compare with that output :)
No what I meant is what benchmark is producing that output.. I am using tool-eval-bench and have not seen that kind of output format
I understood perfectly. Look at Albond’s repo, you’ll find the bench shell script that I’m using
I see, so you use fp8?
Yes. 99% of the times, yes :)
My exception is when I use albond’s hybrid qwen3.5-122b-a10b.
I’ve been trying PrismaQuant for both 27B and 35B but the results in my workflow are ‘iffy’, so I always go back to 3.6-35B-A3B FP8. I do move a lot between DFlash and MTP. They both are trade the crown on my two main workflows, but overall they are very similar in terms of behavior (MTP wins in LONG context work, and DFlash in parallelism).
where you get the Chat Template 3.6 v9 +? my fp8 score is not good compare to 27b. I think my chat template is different than yours.
I am seeing great results during opencode sessions, i ran multi hour coding sessions yesterday without any issues for the first time.
I am using the default recipe and chat v6 template
Model: Qwen/Qwen3.6-35B-A3B-FP8
Score: 97 / 100
Rating: ★★★★★ Excellent
Engine: vLLM 0.20.1rc1.dev172+g3a6bf961c.d20260503
Quantization: FP8
Max context: 262,144 tokens
✅ 14 passed ⚠️ 1 partial ❌ 0 failed
Points: 29/30
Quality: 97/100
Responsiveness: 57/100 (median turn: 2.5s)
Deployability: 85/100 (α=0.7)
Weakest: D Restraint & Refusal (83%)
Completed in 104.6s │ tool-eval-bench v1.8.0
Token Usage:
Total: 70,841 tokens │ Efficiency: 0.4 pts/1K tokens
Recipe please.
I saw 95 point once. But I’ve never seen such point.
That 97 score is only with --short switch, so 15 tests only. To get a more accurate score you would want to run the full 74 tests with --hardmode.
correct, --hardmode bench score dropped 92/100. I was more highlighting that with the latest updates my session are becoming noticeably more reliable. Previously i have struggled with wrong or failed tool calls and endless loops.
From hands on use I’ve noticed when session context starts to reach 70-80k + tokens, i feel the decline in output quality.
