│ Tokens │ RedHat+VLLM_CUTLASS │ GB10+VLLM_CUTLASS+compile │ GB10+FLASHINFER+eager │ GB10+FLASHINFER+compile │
├────────┼─────────────────────┼───────────────────────────┼───────────────────────┼─────────────────────────┤
│ 256 │ 57.6 │ 41.4 │ 42.2 │ 66.1 │
├────────┼─────────────────────┼───────────────────────────┼───────────────────────┼─────────────────────────┤
│ 512 │ 64.4 │ 71.4 │ 40.7 │ 84.4 │
├────────┼─────────────────────┼───────────────────────────┼───────────────────────┼─────────────────────────┤
│ 1024 │ 66.1 │ 76.3 │ 69.8 │ 71.4 │
├────────┼─────────────────────┼───────────────────────────┼───────────────────────┼─────────────────────────┤
│ 2048 │ 64.5 │ 75.7 │ 48.0 │ 79.1 │
└────────┴─────────────────────┴───────────────────────────┴───────────────────────┴─────────────────────────┘
It is real ;)
You are right about this. I got a 6% improvement here.
I am also running this as recipe + mod, works nicely, thanks for the enforce eager hint! :)
For Qwen3-Coder-Next-NVFP4-GB10 SGLang DFlash is +38% faster for short code (150 vs 108 tok/s) but slower for long sequences (-4%). The overall average is close (97 vs 101).
Still figuring out the details, will share later.
Funny I was just asking about SGLang here Why do so many people here prefer vLLM? had it on my list forever but didnt try it was too involved with all the vLLM stuff.
How short exactly is short code? Like the 307 tokens you showing in that screenshot?
Can you specify which image you’re using and how you’re launching vllm sqlang?
I’m using the SGLang nightly with CUDA 13 support:
lmsysorg/sglang:nightly-dev-cu13-20260415-2c9e76d3
DFlash was merged into SGLang on April 7 (PR #22077), so any nightly after that date should work. The older NVIDIA container (nvcr.io/nvidia/sglang:26.02-py3) does not have it (that one ships SGLang 0.5.8 which is too old).
Two patches are needed for Qwen3-Coder-Next NVFP4 on this image:
-
qwen3_next.py
The GDN layers crash with compressed-tensors quantization becauseMergedColumnParallelLineardoesn’t have a.weightattribute. Fix is wrapping the_override_weight_loadercalls inhasattrchecks. -
expert_location.py
The EPLB module fails to resolveQwen3NextForCausalLMviaget_model_architecture(). Fix is a try/except around that call.
Both are volume-mounted, no rebuild needed.
Launch command:
docker run -d --name sglang_production \
--gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
--restart unless-stopped \
-v /data/huggingface:/root/.cache/huggingface \
-v ~/sglang-patches/qwen3_next.py:/sgl-workspace/sglang/python/sglang/srt/models/qwen3_next.py \
-v ~/sglang-patches/expert_location.py:/sgl-workspace/sglang/python/sglang/srt/eplb/expert_location.py \
-p 8000:30000 \
-e HF_HUB_DISABLE_XET=1 \
-e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
-e SGLANG_ENABLE_DEEP_GEMM=0 \
lmsysorg/sglang:nightly-dev-cu13-20260415-2c9e76d3 \
python3 -m sglang.launch_server \
--model-path saricles/Qwen3-Coder-Next-NVFP4-GB10 \
--served-model-name Qwen3-Coder-Next-DFlash \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
--attention-backend flashinfer \
--mem-fraction-static 0.55 \
--max-running-requests 4 \
--disable-cuda-graph \
--mamba-scheduler-strategy extra_buffer \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--host 0.0.0.0 --port 30000
Key things:
- –mamba-scheduler-strategy extra_buffer is required because Qwen3-Next has hybrid GDN (recurrent) layers
- DeepGEMM is disabled (SGLANG_ENABLE_JIT_DEEPGEMM=0) — the scale format doesn’t match and causes accuracy issues on Blackwell
- FP4 GEMM backend auto-selects flashinfer_cudnn which is the fastest on SM120
- CUDA graphs help long generations but hurt short ones. I run without them since my workload is mostly code generation (short-to-medium)
- Startup takes about 5 minutes
I put the patches and full instructions in a gist (AI generated content): SGLang + DFlash on DGX Spark (Qwen3-Coder-Next NVFP4) — 150 tok/s · GitHub
Nice I think I agree with Opus; the responsiveness in Claude Code, combined with Minimax 2.7 or Qwen 397b, would be impressive. It’s great that things like this are coming out; AI is getting more expensive.
Metric Value
Draft tokens per step 16
Avg accept rate 0.26 (26%)
Avg tokens accepted per step ~4.19
Baseline (no DFlash) 1 token/step
Test Tokens Time Throughput
Short (512 tok) 512 4.10s 124.8 tok/s
Medium (1024 tok) 1024 17.10s 59.9 tok/s
I have good results with Goose cli with Qwen3-Coder-Next. I added a coder agent to Claude Code where Opus4.6 does the orchestration and review and uses Goose cli for coding.
I’m using something similar: I route Claude Code through LiteLM, then Opus 4.6 is the same, Sonnet is QWEN397B, and Hayku dflash QWEN3 Coder Next. Claude Code itself launches agents and orchestrates everything within Claude Code. I’ll try Goose, thanks.
Anyone else here noticed the nerfing and extreme subscription token burn for opus 4.6 lately? Insane, its dumber then some of my local models in cases …
After 4 months I wonder if its time to recap llama.cpp, SGLang and vLLM for current state of compatibility and performance on consumer blackwell, but specifically the spark.
I used llama.cpp for gemma4 to just quickly get an impression of the model. What a breeze to get running, however, performance suckz really bad imho, but also didnt invest too much time, vLLM and now SGLang eat up enough of my experimentation time …
Currently switching between qwen3.5 A122 and A27 and qwen-coder-next. Specifically I am doing code audits for security reasoning, so I dont need it to perfectly code something, but I need it to perfectly understand code and its dependencies …
At the moment its tough. A27 often runs into weird thinking loops, but A122-int4 was the worst model in terms of actual results so far.
Same, the 35b-a3b-fp8 is so nimble, as long as you keep the context length short (~70k) it screams, then can fall off a cliff, however this fix has toned down the thought loop problem – I haven’t really experienced one since, the 122b-hybrid-int4fp8 handles larger context (~140k) but is very opinionated at times and requires grill-me sessions at the start to guide its thinking, I still even turn to Qwen Coder Next AutoRound 16kv for debugging when the other two get stuck, and finally the qwen3.5-enhanced.jinja chat template has reduced tool call failures, which makes the whole experence much more confidence inspiring.
At the end of the day, I feel like there is a lot of subjectivity. One day I think one model is better, then it gets stuck, reward hacks, chooses it own adventure and another comes to the rescue. Was it fresh context, was the jar lid already loosened? Some days I feel certain, then after a bad experience, I don’t. My AI coding, prompting and workflows are all improving at the same time so its not a fair test – ever. The benchmarks never reflect the real life experience coding with these. I have to use these exclusively because my clients need air-gapped solutions, so no frontier models allowed.
═══ Benchmark ═══
[✓] Model: Intel/Qwen3.5-35B-A3B-int4-AutoRound
╔══════════════════════════════════════════════════════╗
║ Benchmark: Qwen3.5-35B-A3B-int4-AutoRound — 2026-04-16 15:59
╚══════════════════════════════════════════════════════╝
Warm-up... done
── Sequential (1 request) ──────────────────────────────
Run 1/2:
[Q&A ] 256 tokens in 1.98s = 129.2 tok/s
[Code ] 427 tokens in 2.62s = 162.6 tok/s
[JSON ] 1024 tokens in 7.06s = 144.9 tok/s
[Math ] 32 tokens in .25s = 126.9 tok/s
[LongCode ] 2048 tokens in 11.27s = 181.7 tok/s
Run 2/2:
[Q&A ] 256 tokens in 1.97s = 129.7 tok/s
[Code ] 402 tokens in 2.48s = 161.7 tok/s
[JSON ] 1024 tokens in 7.85s = 130.3 tok/s
[Math ] 32 tokens in .25s = 124.5 tok/s
[LongCode ] 2048 tokens in 11.26s = 181.7 tok/s
── Concurrent (4 parallel requests) ───────────────────────────
Sending 4 requests simultaneously, measuring total throughput...
[req1 ] 1024 tokens = 82.7 tok/s (end-to-end)
[req2 ] 1024 tokens = 83.7 tok/s (end-to-end)
[req3 ] 1024 tokens = 84.5 tok/s (end-to-end)
[req4 ] 1024 tokens = 82.4 tok/s (end-to-end)
Total: 4096 tokens in 12.44s
Total throughput: 329.2 tok/s (4 requests completed)
with @eugr spark-vllm-docker,so fast
Hello all,
Thank you all for your work. I just tried for 2 days the Qwen3-coder-next-nvfp4 DFLASH in my hermes-agents. My expérience is good in term of speed and inference performance howerver the quality output, tool execution, reasoning is poorer than the Qwen-3-coder-next-int4-autoround. So I would say that for the actual performance benchmarked in spark arena the trade-off this optimization is too much.
This is my opinion and I may have made mistake.
Cheers !
DFlash and other drafting methods are transparent to the underlying model. So functionally it does not matter which drafter you choose; pick whichever is fastest for you in your use case. The fidelity of the quant relative to the full weight model used to train the drafter will also play into the effectiveness of the drafter.
You can use DFlash with the int4-autoround quant.
Both DFlash and AutoRound INT4 depend on how the base LLM is prepared. With sufficient time and hardware, you can build versions that outperform what is publicly available.
- AutoRound INT4 is not a simple BF16-to-INT4 conversion — it includes a training process to ensure the quantized model’s output closely matches the original BF16 quality. If you have the resources, you can train a custom AutoRound INT4 model tailored to your specific use cases. Intel provides a default version and the necessary tooling, but the results can be improved with custom training.
- DFlash uses a diffusion-based approach, meaning it generates a block of tokens at once rather than one token at a time — similar to how Stable Diffusion generates an image. Like any diffusion model, the public default version may produce artifacts (think of the classic “six fingers” issue in image generation). Training DFlash on your own data eliminates these problems.
That said, I currently don’t have the resources to run custom training for my specific cases. For now, I’m satisfied with AutoRound INT4, which delivers around 50 tokens/second — a good balance of speed and quality for my needs.
The root issue is simple (DFlash worst AutoRound INT4 for you): neither of us has the budget or hardware to do that (we are just beggars).
And honestly, there’s a bit of irony here: if either of us had the resources to properly train a high-quality DFlash or AutoRound INT4 model, we wouldn’t even need these optimizations in the first place — we could just run a full-quality model like GLM-5 directly and not bother with any of this.
Hey everyone,
I’ve been following the progress on NVFP4 and DFlash for the Spark, but I think we’re leaving massive performance on the table by ignoring kernel launch overhead.
Even with a tiny 0.1B–0.8B drafter, we’re wasting 30–50% of the drafting cycle in CPU-to-GPU “bubbles” (1.5ms+ of launch noise). This effectively kills the “speculative profit” when running against a high-speed target like Nemotron-3 Super.
The Proposal: Why don’t we just take the Luce Megakernel work and drop it in as the DFlash drafter?
-
The Bridge: Luce fuses all 24 layers of Qwen 3.5-0.8B (DeltaNet/Attention hybrid) into a single persistent dispatch. Drafting becomes near-instant (~0.5ms), making the speculative step essentially “free.”
-
Proof of Portability: It’s already been vibe-coded to an RTX 3060 by just changing the
num_blocksconstant. It should map to the Grace Blackwell (GB10) SMs with minimal friction. (Proof: The CUDA Trick That Makes LLMs Faster) -
The “Trifecta” ROI: By fusing the drafter, we can sustain wider draft trees (DDTree) without a latency penalty. This is the path to hitting 200+ tps on the 100B+ models we’re already working on.
Repo: Lucebox-hub / Megakernel
Has anyone already tried dropping a persistent kernel drafter into the current DFlash/vLLM branch for the Spark? Seems like the most logical way to bridge the gap between our current 67 tps and the hardware’s actual ceiling.
