DFlash LLM for DGX Spark - too good to be true?

I logged on today expected to see a discussion about this.

Is it for real? Almost too good to be true.

Been looking at DFlash and it seems real. Its diffusion based speculative decoding from z-lab.

Seems like this guy got it working on the spark. Going to test this out.

BTW, dFlash is supported in our community spark-vllm-docker too.

AWESOME! Going to test it out. Do i run it via –speculative-config ?

Basically just rebuild and use the recommended flags in those repos.

It works but currently only on flash attention backend. That holds it back. Also, depending on the domain, acceptance rates can vary substantially.

An example (it works with quantized models too, not just BF16 ones!):

If you haven’t rebuilt the container recently, rebuild it first:

git pull
./build-and-copy.sh --tf5

Then run:

./launch-cluster.sh -t vllm-node-tf5 \
--solo \
--apply-mod mods/fix-qwen3.5-chat-template \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
exec vllm serve Intel/Qwen3.5-35B-A3B-int4-AutoRound \
  --host 0.0.0.0 \
  --port 8888 \
  --max-model-len 262144 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.7 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --load-format fastsafetensors \
  --enable-prefix-caching \
  --chat-template unsloth.jinja \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-35B-A3B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn

The performance varies depending on the task.

With llama-benchy or anything with non-trivial context (summarize a book, etc), the performance will likely be worse than the model without spec decoding.

I’m getting 31 t/s on this model with llama-benchy.

For something like “write me a hello, world page in HTML” it will give you >70 t/s (>100 t/s without a system prompt with tool definitions).

I’m working on llama-benchy extension to measure spec. decoding for different types of prompts.

I tried this on some real tasks I often do, with the repo above and the Avg Draft acceptance rate is between 10% and 25% for me. After a while it goes up to 6070% sometimes though. So basically not that good for most of the time in the end and not that much of a speedup. You probably have to messure your own workloads and if it works for you :) and the repo container crashed with some cuda stuff as well … so probably stick to the tested community stuff :D

Edit: The crash may well have been caused by the speculative decoding stuff tho.


── Run 1/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 2.14s = 119.6 tok/s (prompt: 23)
  [Code] 494 tokens in 3.11s = 158.8 tok/s (prompt: 30)
  [JSON] 1024 tokens in 7.88s = 129.9 tok/s (prompt: 48)
  [Math] 64 tokens in .43s = 148.8 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 11.67s = 175.4 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 2.14s = 119.6 tok/s (prompt: 23)
  [Code] 494 tokens in 3.11s = 158.8 tok/s (prompt: 30)
  [JSON] 1024 tokens in 8.47s = 120.8 tok/s (prompt: 48)
  [Math] 64 tokens in .51s = 125.4 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 13.28s = 154.2 tok/s (prompt: 37)


dflash qwen35b-int4 tested

We started to talk about dflash here Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 - #14 by dkopko

it works even with the “plain” vLLM docker nightly image.

It runs stable with the eugrs docker image just fyi :)

Got DFlash working with Qwen3-Coder-Next on my Thinkstation PGX with 88-108 tok/s.

Turns out the EAGLE3 hidden state collection is already fully implemented in vLLM’s Qwen3NextModel (via EagleModelMixin), but the outer class is missing the SupportsEagle3 marker interface. Without it, vLLM refuses to activate DFlash.

The fix is 2 lines: add SupportsEagle3 to the imports and class inheritance in qwen3_next.py. No new files, no Docker rebuild (just volume-mount the patched file).

  • Target model: saricles/Qwen3-Coder-Next-NVFP4-GB10 (43 GB)
  • Drafter: z-lab/Qwen3-Coder-Next-DFlash (~900 MB)
  • Image: vllm-node-tf5 (vLLM v0.19.1)

Full writeup with patch, docker run command, and benchmarks: DFlash speculative decoding for Qwen3-Coder-Next on DGX Spark — 2-line vLLM patch, 88-108 tok/s · GitHub

INT4-AutoRound does not work with DFlash. Use NVFP4 or FP8. Also keep gpu-memory-utilization at 0.60 or below (I had two OOM crashes at 0.85 before learning that lesson).

My first testing results are quite surprising. It seems that the quality is higher than the INT4 I used before. I created a test where several snippets are shown and the model has to decide which fact is true, based on context. Overall better grounded models score higher. This is one of the few times a model has a perfect score.

Do you have an image that actually can handle NVFP4 without:

(EngineCore pid=108) WARNING 04-14 12:41:13 [marlin.py:34] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore pid=108) WARNING 04-14 12:41:13 [marlin_utils_fp4.py:300] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

As thats the usual problem. But currently giving it a try :)

That Marlin fallback warning is strange on Blackwell since FP4 should be native. On mine the logs show the correct kernels:

Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
Using ‘FLASHINFER_CUTLASS’ NvFp4 MoE backend

No Marlin warnings at all. My image is vllm-node-tf5, built from the NVIDIA DGX Spark playbooks Dockerfile with TORCH_CUDA_ARCH_LIST=12.1a and FLASHINFER_CUDA_ARCH_LIST=12.1a. Which image are you using? If it wasn’t compiled for arch 12.1 it might not pick up the native FP4 paths.

I thought that was the usual thing as the consumer GB10 does not support this natively.

Running the eugr TF5 image:

0.19.1rc1.dev241+g4d042ed85.d20260413

Ah my bad I used the ENV vars from the HF page :D ooops. Removed them, now it uses cutlass as well. As usual the performance does seem a bit better with the marlin backend though. But not much of a difference so far.

But at least that should mean the cutlass patches and fixes for NVFP4 have finally been integrated and it is as fast as marlin apparently?

There is also DFlash für gpt-oss-120b, which is already fast in Spark and might get a real boost from it.
Nevertheless, it currently does not work with our community build (spark-vllm-docker).
Anyone has an idea what needs to be tweak to make MXFP4 and DFlash working?

Oh look, another new speculative decoding thing just dropped, yay!

DFlash + Qwen3-Coder-Next on eugr’s spark-vllm-docker — early test

Confirming the gist’s 2-line SupportsEagle3 patch
( DFlash speculative decoding for Qwen3-Coder-Next on DGX Spark — 2-line vLLM patch, 88-108 tok/s · GitHub ) composes
cleanly with eugr’s spark-vllm-docker setup — wired in as a mods/ script,
applied at container start, no image rebuild needed.

Setup

  • Hardware: DGX Spark, GB10 (1 GPU, 128 GB unified memory)
  • Container: vllm-node-tf5 (eugr’s image), vLLM 0.19.1rc1.dev241+g4d042ed85.d20260413
  • Storage: model weights on USB-attached external SSD (ext4) — relevant
    for the load times below; an internal NVMe would likely be faster
  • Target: saricles/Qwen3-Coder-Next-NVFP4-GB10
  • Drafter: z-lab/Qwen3-Coder-Next-DFlash
  • Launch flags from gist (verbatim): --enforce-eager, --attention-backend flash_attn,
    --max-num-batched-tokens 32768, --max-num-seqs 4,
    --gpu-memory-utilization 0.60, num_speculative_tokens 15
  • --max-model-len: not set (vLLM auto-resolves to model default 262144), same as gist’s command

NVFP4 path looks healthy (no fallback)

Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
[Autotuner]: Tuning fp4_gemm: 100% (16/16) — completed

So we’re not hitting the architecture-mismatch / FP4 fallback issue from
later in the thread.

Throughput observed

Single-stream, code-generation prompt (HTML/CSS/JS calculator, ~500 token output):

  • Steady-state generation: ~36 tok/s
  • Draft acceptance: ~20-35%, occasionally 14%, briefly touching 50%+
  • First request ~minutes (cold KV cache + spec decode init); subsequent in
    the 27-36 t/s band

This is consistent with @eugr’s “31 t/s for complex tasks vs >70 t/s for
simple HTML generation” and @norman.2’s “10-25% initial acceptance, can
climb to 60-70%” — feels prompt-complexity bound, not config bound.

Caveats

  • Two prompts only — not an extensive sweep
  • Default WebUI sampling settings (likely temperature ~0.7+); have not yet
    retested with low temperature

Startup cost (something most benchmarks skip but matters in practice)

Wall-clock from container start → “Application startup complete”:

Mode Total Weight load Post-load (KV + warmup ± compile)
--enforce-eager ~9.5 min 6:55 ~1 min
compile (no --enforce-eager) ~18 min 6:55 5:14 (“init engine” w/ Inductor compile + CUDA graph capture) + ~3 min routes

Compile mode roughly doubles time-to-ready. Cache helps on subsequent
identical-arg launches, but any flag change invalidates it. Worth knowing
when iterating on recipes.

Eager vs compile mode A/B (same prompts, same model, same flags otherwise)

Prompts: Q1 = “create http calculator w/ simple preview”; Q2 = “add exponential button”
(both targeting Open WebUI default sampling).

Metric --enforce-eager compile mode Δ
Time-to-ready ~9.5 min ~18 min +9 min
Q1 cold (t/s) 9.3 13.7 +47 %
Q2 warm (t/s) 35.9 36.2 ~0
Output tokens Q1 2152 2143
Output tokens Q2 2025 2255

Compile mode buys ~4 t/s on the first cold request and nothing after. Doubling
the startup cost for that is a bad trade for normal use. Reverting to
--enforce-eager for this recipe.

Marlin Backend seems to still be around 10%+ faster by the way :)

I just did a test and got it to work about 23% faster. Now it only works with transformers and lacks tool calling. I will give it a try to fork VLLM and propose some changes.