DFlash LLM for DGX Spark - too good to be true?

DFlash + Qwen3-Coder-Next on eugr’s spark-vllm-docker — early test

Confirming the gist’s 2-line SupportsEagle3 patch
( DFlash speculative decoding for Qwen3-Coder-Next on DGX Spark — 2-line vLLM patch, 88-108 tok/s · GitHub ) composes
cleanly with eugr’s spark-vllm-docker setup — wired in as a mods/ script,
applied at container start, no image rebuild needed.

Setup

  • Hardware: DGX Spark, GB10 (1 GPU, 128 GB unified memory)
  • Container: vllm-node-tf5 (eugr’s image), vLLM 0.19.1rc1.dev241+g4d042ed85.d20260413
  • Storage: model weights on USB-attached external SSD (ext4) — relevant
    for the load times below; an internal NVMe would likely be faster
  • Target: saricles/Qwen3-Coder-Next-NVFP4-GB10
  • Drafter: z-lab/Qwen3-Coder-Next-DFlash
  • Launch flags from gist (verbatim): --enforce-eager, --attention-backend flash_attn,
    --max-num-batched-tokens 32768, --max-num-seqs 4,
    --gpu-memory-utilization 0.60, num_speculative_tokens 15
  • --max-model-len: not set (vLLM auto-resolves to model default 262144), same as gist’s command

NVFP4 path looks healthy (no fallback)

Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
[Autotuner]: Tuning fp4_gemm: 100% (16/16) — completed

So we’re not hitting the architecture-mismatch / FP4 fallback issue from
later in the thread.

Throughput observed

Single-stream, code-generation prompt (HTML/CSS/JS calculator, ~500 token output):

  • Steady-state generation: ~36 tok/s
  • Draft acceptance: ~20-35%, occasionally 14%, briefly touching 50%+
  • First request ~minutes (cold KV cache + spec decode init); subsequent in
    the 27-36 t/s band

This is consistent with @eugr’s “31 t/s for complex tasks vs >70 t/s for
simple HTML generation” and @norman.2’s “10-25% initial acceptance, can
climb to 60-70%” — feels prompt-complexity bound, not config bound.

Caveats

  • Two prompts only — not an extensive sweep
  • Default WebUI sampling settings (likely temperature ~0.7+); have not yet
    retested with low temperature

Startup cost (something most benchmarks skip but matters in practice)

Wall-clock from container start → “Application startup complete”:

Mode Total Weight load Post-load (KV + warmup ± compile)
--enforce-eager ~9.5 min 6:55 ~1 min
compile (no --enforce-eager) ~18 min 6:55 5:14 (“init engine” w/ Inductor compile + CUDA graph capture) + ~3 min routes

Compile mode roughly doubles time-to-ready. Cache helps on subsequent
identical-arg launches, but any flag change invalidates it. Worth knowing
when iterating on recipes.

Eager vs compile mode A/B (same prompts, same model, same flags otherwise)

Prompts: Q1 = “create http calculator w/ simple preview”; Q2 = “add exponential button”
(both targeting Open WebUI default sampling).

Metric --enforce-eager compile mode Δ
Time-to-ready ~9.5 min ~18 min +9 min
Q1 cold (t/s) 9.3 13.7 +47 %
Q2 warm (t/s) 35.9 36.2 ~0
Output tokens Q1 2152 2143
Output tokens Q2 2025 2255

Compile mode buys ~4 t/s on the first cold request and nothing after. Doubling
the startup cost for that is a bad trade for normal use. Reverting to
--enforce-eager for this recipe.