DFlash Gemma4-31B-it on Spark is here. ~2.5x speedup. It will get better with more training

Posting a working DGX Spark / GB10 path for RedHatAI/gemma-4-31B-it-speculator.dflash, since the successful setup here was not the stock path and took a few runtime patches and about 10 hours of whipping Codex.

The sanitized repro bundle is here:

There is also a thin published container path in that repo so DGX users can run the validated setup directly with docker run instead of cloning and wiring the scripts manually. If this isn’t there yet, it should show up in maybe 10 to 20 minutes. still pushing.

That bundle and its documentation were assembled by OpenAI Codex from the working environment.

Test rig

  • Single NVIDIA DGX Spark / GB10

  • vllm/vllm-openai:nightly

  • vLLM 0.19.2rc1.dev21+g893611813

  • Verifier: google/gemma-4-31B-it

  • Draft: RedHatAI/gemma-4-31B-it-speculator.dflash

  • Runtime quantization: fp8

  • Text-only mode

  • max_model_len=16384

  • max_num_batched_tokens=16384

  • tensor_parallel_size=1

The non-obvious part

This did not work for us as a simple “force FlashAttention everywhere” launch.

What actually worked was a split-backend path:

  • Gemma 4 verifier stays on Triton

  • DFlash draft attention in qwen3_dflash is forced onto FlashAttention only for the draft path

The three runtime patches we needed on top of vLLM nightly were:

  1. Disable the prebuilt CUTLASS FP8 capability checks for GB10 so vLLM falls back to supported kernels.

  2. Advertise non-causal support in the Triton backend selector so DFlash can initialize.

  3. Force FlashAttentionBackend only inside qwen3_dflash.

Launch shape

We used:

  • VLLM_DISABLE_COMPILE_CACHE=1

  • --quantization fp8

  • --max-model-len 16384

  • --max-num-batched-tokens 16384

  • --gpu-memory-utilization 0.80

  • --enforce-eager

  • --limit-mm-per-prompt '{"image":0,"video":0}'

  • --speculative-config '{"method":"dflash","model":"RedHatAI/gemma-4-31B-it-speculator.dflash","num_speculative_tokens":8}'

We set num_speculative_tokens=8 explicitly.

Benchmark shape

For an apples-to-apples serve comparison, we used:

  • vllm bench serve

  • dataset: philschmid/mt-bench

  • num-prompts=80

  • max-concurrency=1

  • hf-output-len=2048

  • temperature=0

Observed results

Against a plain non-DFlash baseline on the same verifier and same harness:

  • baseline generation throughput: 5.53 tok/s average

  • DFlash generation throughput: 15.44 tok/s average

  • uplift: about 2.79x

  • DFlash average draft acceptance rate: 28.82%

  • observed DFlash generation range: 9.9 to 28.1 tok/s

  • observed DFlash acceptance range: 15.1% to 62.2%

So the short version is: this draft does run on DGX Spark / GB10 with the stock Google verifier, but today it is not a zero-patch path in vLLM nightly. The key was splitting verifier and draft attention backends instead of trying to drive both through one backend.

The exact launchers, runtime patcher, and benchmark helper are in the repo above so others can reproduce the same path without any cluster-specific details.


I’ve been working on DFlash for LilaRest’s Gemma4 Turbo model for about a week, so this was nice to see RedHat drop this today. FYI - if you just want an easy 10 Tok/s without DFlash, that’s the way to go.

2 Likes

very nice, does DFlash work with dual sparks in TP=2?

Giving 2x a go now, if that works I’ll try in 4x as well and post benchmarks when they’re ready.

Edit: I can confirm it works. Had to add some memory settings because ray wanted to claim too much without them. But it’s benching now. If I don’t post back for a while I fell asleep. Will update the container and instructions as soon as they’re done and I’m conscious.

1 Like

This is not yet stable on TP=2. Getting a steady 18.5 tok/s+, sometimes much better, but ray only stays up for ~10 requests before dreaded " ray.exceptions.RayChannelTimeoutError " and took memory tuning to get this far.

VLLM has quite a few problems that throw this error generically, and many open issues still. Trying the uncompiled DAG path got to about ~20 requests.

Still trying some different things on this nightly build as a fix. I’m trying TP=4 now in hopes that it’s a generic memory tuning issue that might be resolved by TP=4 ironically if I allow ray to consume more, not less. Which should be possible in TP=4 without running out of memory. Trying that and some other things now.

The issue itself is not an OOM, but possibly related to memory tuning flags and not memory itself. Wild goose chase at this point until i can go through vllm github discussions, nightly build code, and tracing.


TP=4 = more stable now but not yet stable – failed at prompt 46. TimeoutError: RPC call to sample_tokens timed out. then EngineDeadError.

200GbE but NCCL_IP_DISABLE=1. Unfortunately, the performance appears pretty capped at 18.5-20 tok/s as the overall average. So, this is pointing to earlier timeout issues on TP=2 being memory related despite not OOM. It seems likely that running TP=2 on this setup without memory tuning in place overcommits and fails, but running with necessary tuning for it to get past that stage introduces the timeout issues, so I’m troubleshooting the TP=4 issues and working backwords.

Looks like this might be highly relevant: [Bug]: RPC call to sample_tokens timed out. Qwen3.5-397B-A17B · Issue #35496 · vllm-project/vllm · GitHub – Trying fix now

Latest update: I have a bad DAC – so I’m going to pause on this work until I get a replacement as have some other stuff I need to work on. Will come back to it though.

  • lane 1: 15201 → 20379
  • lane 2: 976 → 1270
  • lane 3: 15299 → 20485
    errors over 15 seconds…

For now, the cheapest, fastest, most stable way to run text-only 31B is still LilaRest’s Turbo model, which I’ve processed over 50K prompts on at 10 Tok/s (n=8) in TP=4 without a single issue.

Intel’s int4-AutoRound quant of Gemma4-31B-it gets 10-11 tokens/sec all day long, and scales great with concurrency at least up to 4. It would be a great starting point for DFlash.

I tried to map your patches to use int4-AutoRound as the base model with this drafter, but unfortunately am as-yet unsuccessful. int4 uses the MarlinLinearKernel for GPTQMarlinLinearMethod and this is not patched (your repo basically handles the dynamic fp8 path with Triton only).

Failure to start with ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration. Reason: ['head_size not supported']

I would really like to get drafting to work with Gemma4-31B. Would you consider making a patch for the int4-AutoRound path, @meanaverage?

Hi @joshua.dale.warner,

To be honest I’m sort of wrapped up in another project, but wanted to get back to you because this model looks insane (the Intel model). Now I have something new to learn. If this really works at similar accuracy (and that’s always the if) while still providing the image abilities, that is really something.

I see Intel’s AutoRound model like this one are listed at 6B params vs. 33B params for google’s/lilarest’s. Which is kind of confusing to me, as params shouldn’t change due to quantization alone, but I’m not familiar with AutoRound or I32 and the huggingface model page doesn’t really show any ‘missing files’ – so this looks really impressive. But they also don’t include any benchmarks or accuracy promises directly on the page.

As for Flash_Attn, the whole ai tinkerer world is basically working on self patching both flashinfer and flash attention to work with more models in more quants on more platforms. The error you’re seeing isn’t unordinary, but the standard way you discover it hasn’t been made available officially. It’s sort of the first thing you run into when patching before you have to patch 2 or 3 dozen more things.

I will give it a try tomorrow. How important is the vision part to you? :) It’s much easier to turn it off than try to work around it, surely.

HF is just wrong when it comes to int4-Auroround quants size. It’s the full model at W4A16 with select tensors ignored and still BF16. We use these extensively. Right now int4 with Marlin backend is generally the optimal mix of speed and performance.

1 Like