DFlash Gemma4-31B-it on Spark is here. ~2.5x speedup. It will get better with more training

meanaverage · April 21, 2026, 6:33am

Posting a working DGX Spark / GB10 path for RedHatAI/gemma-4-31B-it-speculator.dflash, since the successful setup here was not the stock path and took a few runtime patches and about 10 hours of whipping Codex.

The sanitized repro bundle is here:

https://github.com/meanaverage/gemma4-dflash-spark-vllm

There is also a thin published container path in that repo so DGX users can run the validated setup directly with docker run instead of cloning and wiring the scripts manually. If this isn’t there yet, it should show up in maybe 10 to 20 minutes. still pushing.

That bundle and its documentation were assembled by OpenAI Codex from the working environment.

Test rig

Single NVIDIA DGX Spark / GB10
vllm/vllm-openai:nightly
vLLM 0.19.2rc1.dev21+g893611813
Verifier: google/gemma-4-31B-it
Draft: RedHatAI/gemma-4-31B-it-speculator.dflash
Runtime quantization: fp8
Text-only mode
max_model_len=16384
max_num_batched_tokens=16384
tensor_parallel_size=1

The non-obvious part

This did not work for us as a simple “force FlashAttention everywhere” launch.

What actually worked was a split-backend path:

Gemma 4 verifier stays on Triton
DFlash draft attention in qwen3_dflash is forced onto FlashAttention only for the draft path

The three runtime patches we needed on top of vLLM nightly were:

Disable the prebuilt CUTLASS FP8 capability checks for GB10 so vLLM falls back to supported kernels.
Advertise non-causal support in the Triton backend selector so DFlash can initialize.
Force FlashAttentionBackend only inside qwen3_dflash.

Launch shape

We used:

VLLM_DISABLE_COMPILE_CACHE=1
--quantization fp8
--max-model-len 16384
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.80
--enforce-eager
--limit-mm-per-prompt '{"image":0,"video":0}'
--speculative-config '{"method":"dflash","model":"RedHatAI/gemma-4-31B-it-speculator.dflash","num_speculative_tokens":8}'

We set num_speculative_tokens=8 explicitly.

Benchmark shape

For an apples-to-apples serve comparison, we used:

vllm bench serve
dataset: philschmid/mt-bench
num-prompts=80
max-concurrency=1
hf-output-len=2048
temperature=0

Observed results

Against a plain non-DFlash baseline on the same verifier and same harness:

baseline generation throughput: 5.53 tok/s average
DFlash generation throughput: 15.44 tok/s average
uplift: about 2.79x
DFlash average draft acceptance rate: 28.82%
observed DFlash generation range: 9.9 to 28.1 tok/s
observed DFlash acceptance range: 15.1% to 62.2%

So the short version is: this draft does run on DGX Spark / GB10 with the stock Google verifier, but today it is not a zero-patch path in vLLM nightly. The key was splitting verifier and draft attention backends instead of trying to drive both through one backend.

The exact launchers, runtime patcher, and benchmark helper are in the repo above so others can reproduce the same path without any cluster-specific details.

I’ve been working on DFlash for LilaRest’s Gemma4 Turbo model for about a week, so this was nice to see RedHat drop this today. FYI - if you just want an easy 10 Tok/s without DFlash, that’s the way to go.

arctic.gus · April 21, 2026, 6:48am

very nice, does DFlash work with dual sparks in TP=2?

meanaverage · April 21, 2026, 6:53am

Giving 2x a go now, if that works I’ll try in 4x as well and post benchmarks when they’re ready.

Edit: I can confirm it works. Had to add some memory settings because ray wanted to claim too much without them. But it’s benching now. If I don’t post back for a while I fell asleep. Will update the container and instructions as soon as they’re done and I’m conscious.

meanaverage · April 21, 2026, 5:07pm

This is not yet stable on TP=2. Getting a steady 18.5 tok/s+, sometimes much better, but ray only stays up for ~10 requests before dreaded " ray.exceptions.RayChannelTimeoutError " and took memory tuning to get this far.

VLLM has quite a few problems that throw this error generically, and many open issues still. Trying the uncompiled DAG path got to about ~20 requests.

Still trying some different things on this nightly build as a fix. I’m trying TP=4 now in hopes that it’s a generic memory tuning issue that might be resolved by TP=4 ironically if I allow ray to consume more, not less. Which should be possible in TP=4 without running out of memory. Trying that and some other things now.

The issue itself is not an OOM, but possibly related to memory tuning flags and not memory itself. Wild goose chase at this point until i can go through vllm github discussions, nightly build code, and tracing.

TP=4 = more stable now but not yet stable – failed at prompt 46. TimeoutError: RPC call to sample_tokens timed out. then EngineDeadError.

200GbE but NCCL_IP_DISABLE=1. Unfortunately, the performance appears pretty capped at 18.5-20 tok/s as the overall average. So, this is pointing to earlier timeout issues on TP=2 being memory related despite not OOM. It seems likely that running TP=2 on this setup without memory tuning in place overcommits and fails, but running with necessary tuning for it to get past that stage introduces the timeout issues, so I’m troubleshooting the TP=4 issues and working backwords.

–

Looks like this might be highly relevant: [Bug]: RPC call to sample_tokens timed out. Qwen3.5-397B-A17B · Issue #35496 · vllm-project/vllm · GitHub – Trying fix now

–

Latest update: I have a bad DAC – so I’m going to pause on this work until I get a replacement as have some other stuff I need to work on. Will come back to it though.

lane 1: 15201 → 20379
lane 2: 976 → 1270
lane 3: 15299 → 20485
errors over 15 seconds…

–

For now, the cheapest, fastest, most stable way to run text-only 31B is still LilaRest’s Turbo model, which I’ve processed over 50K prompts on at 10 Tok/s (n=8) in TP=4 without a single issue.

joshua.dale.warner · April 21, 2026, 11:03pm

Intel’s int4-AutoRound quant of Gemma4-31B-it gets 10-11 tokens/sec all day long, and scales great with concurrency at least up to 4. It would be a great starting point for DFlash.

I tried to map your patches to use int4-AutoRound as the base model with this drafter, but unfortunately am as-yet unsuccessful. int4 uses the MarlinLinearKernel for GPTQMarlinLinearMethod and this is not patched (your repo basically handles the dynamic fp8 path with Triton only).

Failure to start with ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration. Reason: ['head_size not supported']

I would really like to get drafting to work with Gemma4-31B. Would you consider making a patch for the int4-AutoRound path, @meanaverage?

meanaverage · April 22, 2026, 8:28am

Hi @joshua.dale.warner,

To be honest I’m sort of wrapped up in another project, but wanted to get back to you because this model looks insane (the Intel model). Now I have something new to learn. If this really works at similar accuracy (and that’s always the if) while still providing the image abilities, that is really something.

I see Intel’s AutoRound model like this one are listed at 6B params vs. 33B params for google’s/lilarest’s. Which is kind of confusing to me, as params shouldn’t change due to quantization alone, but I’m not familiar with AutoRound or I32 and the huggingface model page doesn’t really show any ‘missing files’ – so this looks really impressive. But they also don’t include any benchmarks or accuracy promises directly on the page.

As for Flash_Attn, the whole ai tinkerer world is basically working on self patching both flashinfer and flash attention to work with more models in more quants on more platforms. The error you’re seeing isn’t unordinary, but the standard way you discover it hasn’t been made available officially. It’s sort of the first thing you run into when patching before you have to patch 2 or 3 dozen more things.

I will give it a try tomorrow. How important is the vision part to you? :) It’s much easier to turn it off than try to work around it, surely.

jwarner · April 22, 2026, 11:56am

HF is just wrong when it comes to int4-Auroround quants size. It’s the full model at W4A16 with select tensors ignored and still BF16. We use these extensively. Right now int4 with Marlin backend is generally the optimal mix of speed and performance.

Topic		Replies	Views
Gemma 4 31B on DGX Spark: Runtime FP8 Benchmarks — Single & Dual Node (TP=2) DGX Spark / GB10 llama , agentic-ai	0	1326	April 7, 2026
DFlash LLM for DGX Spark - too good to be true? DGX Spark / GB10	37	1851	April 17, 2026
Gemma 4 Day-1 Inference on NVIDIA DGX Spark — Preliminary Benchmarks DGX Spark / GB10 llama , agentic-ai	17	6150	April 7, 2026
Gemma 4 Models - which vLLM version? Any PRs spotted? DGX Spark / GB10 nim , llama	177	8622	April 16, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6688	March 28, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	231	9764	April 21, 2026
DFlash on DGX — Every Token Rate Scaled by Six? DGX Spark / GB10 Projects	0	184	February 7, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2320	December 25, 2025
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1546	January 7, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4255	February 27, 2026

DFlash Gemma4-31B-it on Spark is here. ~2.5x speedup. It will get better with more training

Related topics