DFlash LLM for DGX Spark - too good to be true?

sggin1 · April 14, 2026, 3:41pm

DFlash + Qwen3-Coder-Next on eugr’s spark-vllm-docker — early test

Confirming the gist’s 2-line SupportsEagle3 patch
( DFlash speculative decoding for Qwen3-Coder-Next on DGX Spark — 2-line vLLM patch, 88-108 tok/s · GitHub ) composes
cleanly with eugr’s spark-vllm-docker setup — wired in as a mods/ script,
applied at container start, no image rebuild needed.

Setup

Hardware: DGX Spark, GB10 (1 GPU, 128 GB unified memory)
Container: vllm-node-tf5 (eugr’s image), vLLM 0.19.1rc1.dev241+g4d042ed85.d20260413
Storage: model weights on USB-attached external SSD (ext4) — relevant
for the load times below; an internal NVMe would likely be faster
Target: saricles/Qwen3-Coder-Next-NVFP4-GB10
Drafter: z-lab/Qwen3-Coder-Next-DFlash
Launch flags from gist (verbatim): --enforce-eager, --attention-backend flash_attn,
--max-num-batched-tokens 32768, --max-num-seqs 4,
--gpu-memory-utilization 0.60, num_speculative_tokens 15
--max-model-len: not set (vLLM auto-resolves to model default 262144), same as gist’s command

NVFP4 path looks healthy (no fallback)

Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
[Autotuner]: Tuning fp4_gemm: 100% (16/16) — completed

So we’re not hitting the architecture-mismatch / FP4 fallback issue from
later in the thread.

Throughput observed

Single-stream, code-generation prompt (HTML/CSS/JS calculator, ~500 token output):

Steady-state generation: ~36 tok/s
Draft acceptance: ~20-35%, occasionally 14%, briefly touching 50%+
First request ~minutes (cold KV cache + spec decode init); subsequent in
the 27-36 t/s band

This is consistent with @eugr’s “31 t/s for complex tasks vs >70 t/s for
simple HTML generation” and @norman.2’s “10-25% initial acceptance, can
climb to 60-70%” — feels prompt-complexity bound, not config bound.

Caveats

Two prompts only — not an extensive sweep
Default WebUI sampling settings (likely temperature ~0.7+); have not yet
retested with low temperature

Startup cost (something most benchmarks skip but matters in practice)

Wall-clock from container start → “Application startup complete”:

Mode	Total	Weight load	Post-load (KV + warmup ± compile)
`--enforce-eager`	~9.5 min	6:55	~1 min
compile (no `--enforce-eager`)	~18 min	6:55	5:14 (“init engine” w/ Inductor compile + CUDA graph capture) + ~3 min routes

Compile mode roughly doubles time-to-ready. Cache helps on subsequent
identical-arg launches, but any flag change invalidates it. Worth knowing
when iterating on recipes.

Eager vs compile mode A/B (same prompts, same model, same flags otherwise)

Prompts: Q1 = “create http calculator w/ simple preview”; Q2 = “add exponential button”
(both targeting Open WebUI default sampling).

Metric	`--enforce-eager`	compile mode	Δ
Time-to-ready	~9.5 min	~18 min	+9 min
Q1 cold (t/s)	9.3	13.7	+47 %
Q2 warm (t/s)	35.9	36.2	~0
Output tokens Q1	2152	2143	—
Output tokens Q2	2025	2255	—

Compile mode buys ~4 t/s on the first cold request and nothing after. Doubling
the startup cost for that is a bad trade for normal use. Reverting to
--enforce-eager for this recipe.

Topic		Replies	Views
Qwen3.6-27B-Dflash link DGX Spark / GB10 Projects	22	3907	April 29, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	23	2612	May 11, 2026
Qwen3.6-27B is out! DGX Spark / GB10 agentic-ai	284	23034	June 3, 2026
Bfloat16 Quality = Speed? DGX Spark / GB10	106	5083	May 26, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	8266	March 28, 2026
Running Step-3.5-Flash on Single Spark DGX Spark / GB10 Projects jetson , llama	20	2985	February 9, 2026
Step-3.7-Flash is supported in community Docker on DGX Spark! DGX Spark / GB10	51	2438	June 3, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	303	24377	June 4, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1685	January 7, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	9809	March 24, 2026