DFlash + Qwen3-Coder-Next on eugr’s spark-vllm-docker — early test
Confirming the gist’s 2-line SupportsEagle3 patch
( DFlash speculative decoding for Qwen3-Coder-Next on DGX Spark — 2-line vLLM patch, 88-108 tok/s · GitHub ) composes
cleanly with eugr’s spark-vllm-docker setup — wired in as a mods/ script,
applied at container start, no image rebuild needed.
Setup
- Hardware: DGX Spark, GB10 (1 GPU, 128 GB unified memory)
- Container:
vllm-node-tf5(eugr’s image), vLLM0.19.1rc1.dev241+g4d042ed85.d20260413 - Storage: model weights on USB-attached external SSD (ext4) — relevant
for the load times below; an internal NVMe would likely be faster - Target:
saricles/Qwen3-Coder-Next-NVFP4-GB10 - Drafter:
z-lab/Qwen3-Coder-Next-DFlash - Launch flags from gist (verbatim):
--enforce-eager,--attention-backend flash_attn,
--max-num-batched-tokens 32768,--max-num-seqs 4,
--gpu-memory-utilization 0.60,num_speculative_tokens 15 --max-model-len: not set (vLLM auto-resolves to model default 262144), same as gist’s command
NVFP4 path looks healthy (no fallback)
Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
[Autotuner]: Tuning fp4_gemm: 100% (16/16) — completed
So we’re not hitting the architecture-mismatch / FP4 fallback issue from
later in the thread.
Throughput observed
Single-stream, code-generation prompt (HTML/CSS/JS calculator, ~500 token output):
- Steady-state generation: ~36 tok/s
- Draft acceptance: ~20-35%, occasionally 14%, briefly touching 50%+
- First request ~minutes (cold KV cache + spec decode init); subsequent in
the 27-36 t/s band
This is consistent with @eugr’s “31 t/s for complex tasks vs >70 t/s for
simple HTML generation” and @norman.2’s “10-25% initial acceptance, can
climb to 60-70%” — feels prompt-complexity bound, not config bound.
Caveats
- Two prompts only — not an extensive sweep
- Default WebUI sampling settings (likely temperature ~0.7+); have not yet
retested with low temperature
Startup cost (something most benchmarks skip but matters in practice)
Wall-clock from container start → “Application startup complete”:
| Mode | Total | Weight load | Post-load (KV + warmup ± compile) |
|---|---|---|---|
--enforce-eager |
~9.5 min | 6:55 | ~1 min |
compile (no --enforce-eager) |
~18 min | 6:55 | 5:14 (“init engine” w/ Inductor compile + CUDA graph capture) + ~3 min routes |
Compile mode roughly doubles time-to-ready. Cache helps on subsequent
identical-arg launches, but any flag change invalidates it. Worth knowing
when iterating on recipes.
Eager vs compile mode A/B (same prompts, same model, same flags otherwise)
Prompts: Q1 = “create http calculator w/ simple preview”; Q2 = “add exponential button”
(both targeting Open WebUI default sampling).
| Metric | --enforce-eager |
compile mode | Δ |
|---|---|---|---|
| Time-to-ready | ~9.5 min | ~18 min | +9 min |
| Q1 cold (t/s) | 9.3 | 13.7 | +47 % |
| Q2 warm (t/s) | 35.9 | 36.2 | ~0 |
| Output tokens Q1 | 2152 | 2143 | — |
| Output tokens Q2 | 2025 | 2255 | — |
Compile mode buys ~4 t/s on the first cold request and nothing after. Doubling
the startup cost for that is a bad trade for normal use. Reverting to
--enforce-eager for this recipe.