DFlash LLM for DGX Spark - too good to be true?

vedcsolution · April 14, 2026, 7:54pm

  │ Tokens │ RedHat+VLLM_CUTLASS │ GB10+VLLM_CUTLASS+compile │ GB10+FLASHINFER+eager │ GB10+FLASHINFER+compile │                              
  ├────────┼─────────────────────┼───────────────────────────┼───────────────────────┼─────────────────────────┤                              
  │ 256    │ 57.6                │ 41.4                      │ 42.2                  │ 66.1                    │                              
  ├────────┼─────────────────────┼───────────────────────────┼───────────────────────┼─────────────────────────┤                              
  │ 512    │ 64.4                │ 71.4                      │ 40.7                  │ 84.4                    │                              
  ├────────┼─────────────────────┼───────────────────────────┼───────────────────────┼─────────────────────────┤
  │ 1024   │ 66.1                │ 76.3                      │ 69.8                  │ 71.4                    │                              
  ├────────┼─────────────────────┼───────────────────────────┼───────────────────────┼─────────────────────────┤                              
  │ 2048   │ 64.5                │ 75.7                      │ 48.0                  │ 79.1                    │                              
  └────────┴─────────────────────┴───────────────────────────┴───────────────────────┴─────────────────────────┘

AEON-7 · April 15, 2026, 2:22am

It is real ;)

timhbl · April 15, 2026, 7:19am

You are right about this. I got a 6% improvement here.

norman.2 · April 15, 2026, 8:16am

I am also running this as recipe + mod, works nicely, thanks for the enforce eager hint! :)

timhbl · April 15, 2026, 2:57pm

For Qwen3-Coder-Next-NVFP4-GB10 SGLang DFlash is +38% faster for short code (150 vs 108 tok/s) but slower for long sequences (-4%). The overall average is close (97 vs 101).
Still figuring out the details, will share later.

norman.2 · April 15, 2026, 3:38pm

Funny I was just asking about SGLang here Why do so many people here prefer vLLM? had it on my list forever but didnt try it was too involved with all the vLLM stuff.

How short exactly is short code? Like the 307 tokens you showing in that screenshot?

vedcsolution · April 15, 2026, 3:42pm

Can you specify which image you’re using and how you’re launching vllm sqlang?

timhbl · April 15, 2026, 3:56pm

I’m using the SGLang nightly with CUDA 13 support:

lmsysorg/sglang:nightly-dev-cu13-20260415-2c9e76d3

DFlash was merged into SGLang on April 7 (PR #22077), so any nightly after that date should work. The older NVIDIA container (nvcr.io/nvidia/sglang:26.02-py3) does not have it (that one ships SGLang 0.5.8 which is too old).

Two patches are needed for Qwen3-Coder-Next NVFP4 on this image:

qwen3_next.py
The GDN layers crash with compressed-tensors quantization because MergedColumnParallelLinear doesn’t have a .weight attribute. Fix is wrapping the _override_weight_loader calls in hasattr checks.
expert_location.py
The EPLB module fails to resolve Qwen3NextForCausalLM via get_model_architecture(). Fix is a try/except around that call.

Both are volume-mounted, no rebuild needed.

Launch command:

docker run -d --name sglang_production \
--gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
--restart unless-stopped \
-v /data/huggingface:/root/.cache/huggingface \
-v ~/sglang-patches/qwen3_next.py:/sgl-workspace/sglang/python/sglang/srt/models/qwen3_next.py \
-v ~/sglang-patches/expert_location.py:/sgl-workspace/sglang/python/sglang/srt/eplb/expert_location.py \
-p 8000:30000 \
-e HF_HUB_DISABLE_XET=1 \
-e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
-e SGLANG_ENABLE_DEEP_GEMM=0 \
lmsysorg/sglang:nightly-dev-cu13-20260415-2c9e76d3 \
python3 -m sglang.launch_server \
--model-path saricles/Qwen3-Coder-Next-NVFP4-GB10 \
--served-model-name Qwen3-Coder-Next-DFlash \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
--attention-backend flashinfer \
--mem-fraction-static 0.55 \
--max-running-requests 4 \
--disable-cuda-graph \
--mamba-scheduler-strategy extra_buffer \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--host 0.0.0.0 --port 30000

Key things:

- –mamba-scheduler-strategy extra_buffer is required because Qwen3-Next has hybrid GDN (recurrent) layers

- DeepGEMM is disabled (SGLANG_ENABLE_JIT_DEEPGEMM=0) — the scale format doesn’t match and causes accuracy issues on Blackwell

- FP4 GEMM backend auto-selects flashinfer_cudnn which is the fastest on SM120

- CUDA graphs help long generations but hurt short ones. I run without them since my workload is mostly code generation (short-to-medium)

- Startup takes about 5 minutes

I put the patches and full instructions in a gist (AI generated content): SGLang + DFlash on DGX Spark (Qwen3-Coder-Next NVFP4) — 150 tok/s · GitHub

vedcsolution · April 15, 2026, 5:03pm

Nice I think I agree with Opus; the responsiveness in Claude Code, combined with Minimax 2.7 or Qwen 397b, would be impressive. It’s great that things like this are coming out; AI is getting more expensive.

Metric	Value
Draft tokens per step	16
Avg accept rate	0.26 (26%)
Avg tokens accepted per step	~4.19
Baseline (no DFlash)	1 token/step
Test	Tokens	Time	Throughput
Short (512 tok)	512	4.10s	124.8 tok/s
Medium (1024 tok)	1024	17.10s	59.9 tok/s

timhbl · April 15, 2026, 7:38pm

I have good results with Goose cli with Qwen3-Coder-Next. I added a coder agent to Claude Code where Opus4.6 does the orchestration and review and uses Goose cli for coding.

vedcsolution · April 15, 2026, 8:01pm

I’m using something similar: I route Claude Code through LiteLM, then Opus 4.6 is the same, Sonnet is QWEN397B, and Hayku dflash QWEN3 Coder Next. Claude Code itself launches agents and orchestrates everything within Claude Code. I’ll try Goose, thanks.

norman.2 · April 15, 2026, 9:35pm

Anyone else here noticed the nerfing and extreme subscription token burn for opus 4.6 lately? Insane, its dumber then some of my local models in cases …

After 4 months I wonder if its time to recap llama.cpp, SGLang and vLLM for current state of compatibility and performance on consumer blackwell, but specifically the spark.

I used llama.cpp for gemma4 to just quickly get an impression of the model. What a breeze to get running, however, performance suckz really bad imho, but also didnt invest too much time, vLLM and now SGLang eat up enough of my experimentation time …

Currently switching between qwen3.5 A122 and A27 and qwen-coder-next. Specifically I am doing code audits for security reasoning, so I dont need it to perfectly code something, but I need it to perfectly understand code and its dependencies …

At the moment its tough. A27 often runs into weird thinking loops, but A122-int4 was the worst model in terms of actual results so far.

whpthomas · April 15, 2026, 11:25pm

Same, the 35b-a3b-fp8 is so nimble, as long as you keep the context length short (~70k) it screams, then can fall off a cliff, however this fix has toned down the thought loop problem – I haven’t really experienced one since, the 122b-hybrid-int4fp8 handles larger context (~140k) but is very opinionated at times and requires grill-me sessions at the start to guide its thinking, I still even turn to Qwen Coder Next AutoRound 16kv for debugging when the other two get stuck, and finally the qwen3.5-enhanced.jinja chat template has reduced tool call failures, which makes the whole experence much more confidence inspiring.

At the end of the day, I feel like there is a lot of subjectivity. One day I think one model is better, then it gets stuck, reward hacks, chooses it own adventure and another comes to the rescue. Was it fresh context, was the jar lid already loosened? Some days I feel certain, then after a bad experience, I don’t. My AI coding, prompting and workflows are all improving at the same time so its not a fair test – ever. The benchmarks never reflect the real life experience coding with these. I have to use these exclusively because my clients need air-gapped solutions, so no frontier models allowed.

say3 · April 16, 2026, 8:20am


═══ Benchmark ═══
[✓] Model: Intel/Qwen3.5-35B-A3B-int4-AutoRound

╔══════════════════════════════════════════════════════╗
║  Benchmark: Qwen3.5-35B-A3B-int4-AutoRound  —  2026-04-16 15:59
╚══════════════════════════════════════════════════════╝

  Warm-up... done

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   1.98s = 129.2 tok/s
  [Code      ]   427 tokens in   2.62s = 162.6 tok/s
  [JSON      ]  1024 tokens in   7.06s = 144.9 tok/s
  [Math      ]    32 tokens in    .25s = 126.9 tok/s
  [LongCode  ]  2048 tokens in  11.27s = 181.7 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   1.97s = 129.7 tok/s
  [Code      ]   402 tokens in   2.48s = 161.7 tok/s
  [JSON      ]  1024 tokens in   7.85s = 130.3 tok/s
  [Math      ]    32 tokens in    .25s = 124.5 tok/s
  [LongCode  ]  2048 tokens in  11.26s = 181.7 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 82.7 tok/s (end-to-end)
  [req2 ]  1024 tokens = 83.7 tok/s (end-to-end)
  [req3 ]  1024 tokens = 84.5 tok/s (end-to-end)
  [req4 ]  1024 tokens = 82.4 tok/s (end-to-end)

  Total: 4096 tokens in 12.44s
  Total throughput: 329.2 tok/s (4 requests completed)

with @eugr spark-vllm-docker,so fast

WilliamD · April 16, 2026, 3:11pm

Hello all,

Thank you all for your work. I just tried for 2 days the Qwen3-coder-next-nvfp4 DFLASH in my hermes-agents. My expérience is good in term of speed and inference performance howerver the quality output, tool execution, reasoning is poorer than the Qwen-3-coder-next-int4-autoround. So I would say that for the actual performance benchmarked in spark arena the trade-off this optimization is too much.

This is my opinion and I may have made mistake.

Cheers !

joshua.dale.warner · April 16, 2026, 4:41pm

DFlash and other drafting methods are transparent to the underlying model. So functionally it does not matter which drafter you choose; pick whichever is fastest for you in your use case. The fidelity of the quant relative to the full weight model used to train the drafter will also play into the effectiveness of the drafter.

You can use DFlash with the int4-autoround quant.

Albond · April 17, 2026, 7:43am

Both DFlash and AutoRound INT4 depend on how the base LLM is prepared. With sufficient time and hardware, you can build versions that outperform what is publicly available.

AutoRound INT4 is not a simple BF16-to-INT4 conversion — it includes a training process to ensure the quantized model’s output closely matches the original BF16 quality. If you have the resources, you can train a custom AutoRound INT4 model tailored to your specific use cases. Intel provides a default version and the necessary tooling, but the results can be improved with custom training.
DFlash uses a diffusion-based approach, meaning it generates a block of tokens at once rather than one token at a time — similar to how Stable Diffusion generates an image. Like any diffusion model, the public default version may produce artifacts (think of the classic “six fingers” issue in image generation). Training DFlash on your own data eliminates these problems.

That said, I currently don’t have the resources to run custom training for my specific cases. For now, I’m satisfied with AutoRound INT4, which delivers around 50 tokens/second — a good balance of speed and quality for my needs.

The root issue is simple (DFlash worst AutoRound INT4 for you): neither of us has the budget or hardware to do that (we are just beggars).

And honestly, there’s a bit of irony here: if either of us had the resources to properly train a high-quality DFlash or AutoRound INT4 model, we wouldn’t even need these optimizations in the first place — we could just run a full-quality model like GLM-5 directly and not bother with any of this.

nithinrajan · April 17, 2026, 3:17pm

Hey everyone,

I’ve been following the progress on NVFP4 and DFlash for the Spark, but I think we’re leaving massive performance on the table by ignoring kernel launch overhead.

Even with a tiny 0.1B–0.8B drafter, we’re wasting 30–50% of the drafting cycle in CPU-to-GPU “bubbles” (1.5ms+ of launch noise). This effectively kills the “speculative profit” when running against a high-speed target like Nemotron-3 Super.

The Proposal: Why don’t we just take the Luce Megakernel work and drop it in as the DFlash drafter?

The Bridge: Luce fuses all 24 layers of Qwen 3.5-0.8B (DeltaNet/Attention hybrid) into a single persistent dispatch. Drafting becomes near-instant (~0.5ms), making the speculative step essentially “free.”
Proof of Portability: It’s already been vibe-coded to an RTX 3060 by just changing the num_blocks constant. It should map to the Grace Blackwell (GB10) SMs with minimal friction. (Proof: The CUDA Trick That Makes LLMs Faster)
The “Trifecta” ROI: By fusing the drafter, we can sustain wider draft trees (DDTree) without a latency penalty. This is the path to hitting 200+ tps on the 100B+ models we’re already working on.

Repo: Lucebox-hub / Megakernel

Has anyone already tried dropping a persistent kernel drafter into the current DFlash/vLLM branch for the Spark? Seems like the most logical way to bridge the gap between our current 67 tps and the hardware’s actual ceiling.

Topic		Replies	Views
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	18	1149	April 16, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6574	March 28, 2026
Running Step-3.5-Flash on Single Spark DGX Spark / GB10 Projects jetson , llama	20	2568	February 9, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1530	January 7, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	8602	March 24, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	4807	March 16, 2026
DFlash: Block Diffusion for Flash Speculative Decoding(Blackwell 6000 Pro) JAX llm , llama-31-8b-instruct , llama	5	235	February 8, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	524	March 3, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	226	9103	April 18, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	3912	March 6, 2026

DFlash LLM for DGX Spark - too good to be true?

Related topics