Gemma4 draft models are now available

Google has updated the Gemma4 releases with “-assistant” models which are their take on official MTP drafters! These work differently enough that they require a slightly new framework.

They are smaller in parameter count than DFlash block diffusion models - the 31B drafter is under 1GB at full weight - and are multimodal. At release the only interface supported is HuggingFace Transformers but a Google engineer has a PR in at vLLM as of an hour ago with support there. We should be able to pull this PR into spark-vllm-docker to play with them: [Spec Decode] Add Gemma4 MTP speculative decoding support by lucianommartins · Pull Request #41745 · vllm-project/vllm

The vLLM PR states that on a H100, the drafter more than tripled throughput for Gemma4-31B. Gains are more modest on the smaller models. Also of note: drafting at TP>1 should be supported.

One of the most interesting features of this release is the concept of a heuristic to control the number of drafted tokens dynamically. This is supported in HuggingFace Transformers. In theory, this should allow the drafter’s host to scale up or down the number of drafted tokens per-cycle based on the acceptance rate of the prior cycle. I think that should probably become standard everywhere. I had a very brief look, but it is not entirely clear to me yet if the vLLM PR above includes heuristic prediction length.

Google’s official MTP documentation page is here, linking to the section where that heuristic is discussed: https://ai.google.dev/gemma/docs/mtp/mtp#draft_tokens

I thought DFlash also supported multimodality? And the amount of memory usage is not a big deal when we have 128GB for such a small model. DFlash seems like it will still be the better solution here.

I’m excited about the possibility of Gemma 4’s MTP support in more memory-constrained situations, like running on my smartphone.

Single Spark Inference - vLLM needs this PR: [Spec Decode] Add Gemma4 MTP speculative decoding support by lucianommartins · Pull Request #41745 · vllm-project/vllm · GitHub

Without MTP:

| model                               |           test |            t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
| :---------------------------------- | -------------: | -------------: | -----------: | ---------------: | ---------------: | ---------------: |
| Intel/gemma-4-31B-it-int4-AutoRound |         pp2048 | 869.53 ± 11.65 |              |  2360.64 ± 31.35 |  2357.63 ± 31.35 |  2360.64 ± 31.35 |
| Intel/gemma-4-31B-it-int4-AutoRound |          tg128 |   11.43 ± 0.00 | 12.00 ± 0.00 |                  |                  |                  |
| Intel/gemma-4-31B-it-int4-AutoRound | pp2048 @ d4096 |  835.68 ± 5.08 |              |  7355.80 ± 44.70 |  7352.79 ± 44.70 |  7355.80 ± 44.70 |
| Intel/gemma-4-31B-it-int4-AutoRound |  tg128 @ d4096 |   11.30 ± 0.00 | 12.00 ± 0.00 |                  |                  |                  |
| Intel/gemma-4-31B-it-int4-AutoRound | pp2048 @ d8192 |  791.49 ± 2.07 |              | 12941.54 ± 34.43 | 12938.53 ± 34.43 | 12941.54 ± 34.43 |
| Intel/gemma-4-31B-it-int4-AutoRound |  tg128 @ d8192 |   11.13 ± 0.00 | 12.00 ± 0.00 |                  |                  |                  |

llama-benchy (0.3.7)
date: 2026-05-05 18:55:40 | latency mode: api

With MTP:

llama-benchy (0.3.7)
date: 2026-05-05 18:55:40 | latency mode: api

| model                               |           test |            t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
| :---------------------------------- | -------------: | -------------: | -----------: | ---------------: | ---------------: | ---------------: |
| Intel/gemma-4-31B-it-int4-AutoRound |         pp2048 | 828.11 ± 13.47 |              |  2478.03 ± 40.71 |  2474.96 ± 40.71 |  2478.03 ± 40.71 |
| Intel/gemma-4-31B-it-int4-AutoRound |          tg128 |   22.05 ± 2.26 | 28.00 ± 2.16 |                  |                  |                  |
| Intel/gemma-4-31B-it-int4-AutoRound | pp2048 @ d4096 |  794.26 ± 0.33 |              |   7740.21 ± 2.58 |   7737.14 ± 2.58 |   7740.21 ± 2.58 |
| Intel/gemma-4-31B-it-int4-AutoRound |  tg128 @ d4096 |   18.40 ± 0.34 | 23.33 ± 0.47 |                  |                  |                  |
| Intel/gemma-4-31B-it-int4-AutoRound | pp2048 @ d8192 |  757.44 ± 4.19 |              | 13524.48 ± 74.56 | 13521.41 ± 74.56 | 13524.48 ± 74.56 |
| Intel/gemma-4-31B-it-int4-AutoRound |  tg128 @ d8192 |   17.39 ± 1.02 | 22.33 ± 0.94 |                  |                  |                  |

llama-benchy (0.3.7)
date: 2026-05-05 19:14:59 | latency mode: api

Did you have to set the number of generated tokens? I’m curious if the heuristic mode is included in that PR, either automatically or by a different name.

Wrote a brief post here.

TL;DR:

vllm serve Intel/gemma-4-31B-it-int4-AutoRound \
    --host 0.0.0.0 \
    --port 8080 \
    --max-model-len 262144 \
    --max-num-seqs 4 \
    --max-num-batched-tokens 16384 \
    --gpu-memory-utilization 0.9 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --trust-remote-code \
    --tool-call-parser gemma4 \
    --reasoning-parser gemma4 \
    --load-format fastsafetensors \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --speculative-config '{"model": "google/gemma-4-31B-it-assistant", "num_speculative_tokens": 2, "method": "gemma4_mtp"}' \
    --tensor-parallel-size 1

You will need this PR and a fresh build of the transformers wheel.

I hope that PR is followed shortly with a heuristic mode PR where the number of tokens can be dynamic in vLLM.

Meanwhile, literally yesterday Google posted a blog regarding the use of DFlash drafting on their cloud hardware versus EAGLE: Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding - Google Developers Blog

That has head-to-head comparisons, but it feels like we need a third competitor. Z-lab’s DFlash drafters for 31B and 26B-A4B are both on HF, but still gated, so we can’t play with them yet. They must be pretty much done training, though, for that blog post to have gone live.

Official DFlash drafters are now publicly available! Head to head comparisons on Spark are now possible:

I am quite impressed by the Gemma4 MTP draft models. There is some clever stuff going on in how they are wired in with sliding and full attention, keeping acceptance high even with long context and difficult text - where most drafters have trouble.

My baseline is ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ which is a third smaller than Nvidia’s NVFP4 quant and in my testing basically equivalent (better than multiple NVFP4 alternatives, including Red Hat’s) while retaining multimodality. This one is calibrated with 4M tokens. I see average acceptance rates above 30% for "num_speculative_tokens":7 even for my highly complex domain text on prompts over 30k. This leapfrogs the model from about 11 tok/s to 19-21 effective, and suggests this quant is quite well done.

For reference, on these documents DFlash falls apart at about the 2nd token position and Qwen3.6 is optimal at 3. With this rate of acceptance, the actual throughput is better than fully optimized Qwen3.6-27B with MTP=3 (further magnified with about 40% less thinking).

On benchmarks, code, and especially structured output it is even better, as you would expect. About 40 tok/s is nearly a 300% increase from baseline:

╔══════════════════════════════════════════════════════╗
║  Benchmark: Gemma4-31b-it  —  2026-05-07 00:16
╚══════════════════════════════════════════════════════╝

  Warm-up... done

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   6.54s = 39.1 tok/s
  [Code      ]   512 tokens in  12.08s = 42.3 tok/s
  [JSON      ]  1024 tokens in  19.69s = 52.0 tok/s
  [Math      ]    32 tokens in   1.03s = 30.9 tok/s
  [LongCode  ]  2048 tokens in  49.74s = 41.1 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   6.55s = 39.0 tok/s
  [Code      ]   512 tokens in  12.12s = 42.2 tok/s
  [JSON      ]  1024 tokens in  19.69s = 51.9 tok/s
  [Math      ]    32 tokens in   1.04s = 30.7 tok/s
  [LongCode  ]  2048 tokens in  49.84s = 41.0 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 30.3 tok/s (end-to-end)
  [req2 ]  1024 tokens = 31.3 tok/s (end-to-end)
  [req3 ]  1024 tokens = 29.6 tok/s (end-to-end)
  [req4 ]  1024 tokens = 31.5 tok/s (end-to-end)

  Total: 4096 tokens in 34.59s
  Total throughput: 118.3 tok/s (4 requests completed)

Rebuild spark-vllm-docker with the --rebuild-vllm flag to pick up the recently merged Gemma4 MTP code, then here is my startup (1.2 million tokens available in KV cache; I run the container detached in case of disconnect):

#!/bin/bash
~/containers/spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 --solo -d \
  exec vllm serve ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ \
  --served-model-name Gemma4-31b-it \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --port 8000 \
  --host 0.0.0.0 \
  --max-num-seqs 4 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-batched-tokens 16384 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --load-format instanttensor \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  --speculative-config '{"method":"mtp","model":"google/gemma4-31b-it-assistant","num_speculative_tokens":7}'