With dflash, I am seeing over 50 t/s on some coding tasks with a dual spark.
Unfortunately it seems like one of my GX10s has suddenly started shutting down during load, might open it up to repaste it - it seems to be running a little hotter than the other one.
This is the first framework to support both DFlash and DDTree on GB10. I just got it working with the above. Benching is problematic as llama.cpp doesnโt support metrics and speculative decoding is enabled. Here is a reference for everything at defaults:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Benchmark: Qwen3.6-27B-Q4_K_M โ 2026-04-23 17:50
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Warm-up... done
โโ Sequential (1 request) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Run 1/2:
[Q&A ] 256 tokens in 7.72s = 33.1 tok/s
[Code ] 512 tokens in 15.78s = 32.4 tok/s
[JSON ] 1024 tokens in 23.10s = 44.3 tok/s
[Math ] 32 tokens in .88s = 36.1 tok/s
[LongCode ] 2048 tokens in 50.58s = 40.4 tok/s
Run 2/2:
[Q&A ] 256 tokens in 7.56s = 33.8 tok/s
[Code ] 512 tokens in 15.68s = 32.6 tok/s
[JSON ] 1024 tokens in 22.57s = 45.3 tok/s
[Math ] 32 tokens in .89s = 35.6 tok/s
[LongCode ] 2048 tokens in 50.38s = 40.6 tok/s
Concurrency is nonexistant, prefill is poor (hardcodes ubatch=192 somewhere), itโs llama.cpp under the hood. Spinning it up had some bumpiness. But it does in fact serve Qwen3.6-27B (in this case Q4_K_M) at speeds never seen before on one Spark.
The gains are mostly real, too - for domain text and complex stuff I see 25-28 tok/s actual.
Whatโs the feedback been with tool calling and fairly complex coding tasks? Iโve tried a few other Qwen models and theyโve been somewhat disappointing compared to other agentic-esk models. Iโm using Minimax M2.7 right now. Canโt find any benchmarks comparing the two directly, so figured Iโd ask here.
have you tried the qwen models using the fixed template + qwen_xml tool parser? it seems to fix issues for a lot of folks especially when using it in open code
Hmm, maybe I havenโt used that fixed template, I was experiencing a lot of issues when using Qwen with Claude Code. Iโll go through Eugrs repo and see if I can find an example of the template and parser being used
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ๐ง Tool-Call Benchmark โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Qwen/Qwen3.6-27B-FP8 via vllm @ http://0.0.0.0:8080 โ
โ 15 scenarios v1.4.1 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โ TC-01 Direct Specialist Match โ PASS 2/2 9.0s ttft=2,514ms t2 Used
get_weather with Berlin only.
โ TC-02 Distractor Resistance โ PASS 2/2 6.3s ttft=1,827ms t2 Used
only get_stock_price for AAPL.
โ TC-03 Implicit Tool Need โ PASS 2/2 13.2s ttft=4,111ms t3 Looked
up Sarah before sending the email.
โ TC-04 Unit Handling โ PASS 2/2 6.6s ttft=2,257ms t2
Requested Tokyo weather in Fahrenheit explicitly.
โ TC-05 Date and Time Parsing โ PASS 2/2 17.2s ttft=9,524ms t2 Parsed
next Monday and included the requested meeting details.
โ TC-06 Multi-Value Extraction โ PASS 2/2 10.2s ttft=4,666ms t2 Issued
separate translate_text calls for both languages.
โ TC-07 Search โ Read โ Act โ PASS 2/2 20.1s ttft=2,984ms t5
Completed the full four-step chain with the right data.
โ TC-08 Conditional Branching โ PASS 2/2 16.1s ttft=5,811ms t3 Checked
the weather first, then set the rainy-day reminder.
โ TC-09 Parallel Independence โ PASS 2/2 10.2s ttft=3,302ms t2 Handled
both independent tasks.
โ TC-10 Trivial Knowledge โ PASS 2/2 3.2s ttft=3,091ms Answered
directly without tool use.
โ TC-11 Simple Math โ PASS 2/2 8.1s ttft=7,996ms Did the
math directly โ good restraint.
โ TC-12 Impossible Request โ PASS 2/2 13.0s ttft=6,326ms Refused
cleanly because no delete-email tool exists.
โ TC-13 Empty Results โ PASS 2/2 17.0s ttft=2,765ms t4 Retried
after the empty result and recovered.
โ TC-14 Malformed Response โ ๏ธ PARTIAL 1/2 7.8s ttft=2,054ms t2
Acknowledged the error but did not attempt an alternative source.
โ TC-15 Conflicting Information โ PASS 2/2 11.0s ttft=2,517ms t3 Used
the searched population value in the calculator.
in claude code I had better luck with 8 but even then I didnโt really see draft rates go above 50%. also we might see better rates when the actual 3.6 dflash model gets released for the 27b model. you where using z-lab/Qwen3.5-27B-DFlash right?
Performance feels similar to what I was getting the whole week with Minimax M2.7, but itโs my first time using MTP and it feels a bit inconsistent. PP feels slower though.
Here is a session of agentic coding, from 0 K t o 100 K context:
One thing I noticed is how often the model just stops. I didnโt have to think about this issue for two whole weeks with M2.7. I did build a way to have the agent auto-continue in https://www.npmjs.com/package/openfox , with the planner creating criteria and the builder can loop while theyโre not completed.
Iโm still evaluating its capability, but it feels strong.
Here is a real world benchmark I run at the moment. Its a real task that reflects exactly what I am doing all day basically, runtime only. Its heavy code reading and verification and vulnerability verification / bug finding stuff. Its not running parallel requests, then the RTX gets way ahead obviously. The Spark also improves a bit tho.
Everything is measured after new bootup but after warmup requests. The results had similiar accuracy and all were correct.
Wonder how autoround compares to quality of FP8 and 5.5bit Prismaquant. For me FP8 and Prismaquant are comparable, the question is how much worse is Int4 Autoround?
On tool bench int4 autoround got 88/100 points, vs 93/100 on FP8 quant. TG was nearly double on int4-autoround, but interestingly PP was almost double on FP8.