I just tried to switch lanes from vLLM — been running llama.cpp + spiritbuun’s DFlash fork on 35B-A3B and getting some wild numbers.
The key was tuning --spec-draft-p-min 0.3 (kills low-confidence drafts early) and --spec-draft-n-max 14. Block-diffusion + MoE is a
great combo — only ~3B params activate per verify so cycles are fast.
35B-A3B results (after quality fix):
HTML/JS coding (~600 tok): 92-101 tok/s
HTML/JS sustained (~2000 tok): 85-92 tok/s
Short chat: ~44 tok/s (DFlash actually hurts here — verify overhead > gain, stock 60-66 tok/s is better for chat)
Yep also around similar speeds with Dflash on Qwen 3.6 27B with sliding window attention PR and vLLM 19.2 - man I hope so badly that they release a 122B 3.6 Model. It would be the perfect mix of quality and speed. Right now I feel the 27B compromises on speed and the 35B compromises on Quality and the 3.5 122B is worse in Quality to both according to nearly all benchmarks.
Yes, the fast loading time is a great advantage. I’ve set up my DGX as a local AI server for our online team, which helps save a significant amount of power. I created a script that automatically starts the system when triggered, making the AI ready in just 5 to 10 seconds. Then, if it remains idle for [X] minutes, it automatically shuts down.