I am quite impressed by the Gemma4 MTP draft models. There is some clever stuff going on in how they are wired in with sliding and full attention, keeping acceptance high even with long context and difficult text - where most drafters have trouble.
My baseline is ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ which is a third smaller than Nvidia’s NVFP4 quant and in my testing basically equivalent (better than multiple NVFP4 alternatives, including Red Hat’s) while retaining multimodality. This one is calibrated with 4M tokens. I see average acceptance rates above 30% for "num_speculative_tokens":7 even for my highly complex domain text on prompts over 30k. This leapfrogs the model from about 11 tok/s to 19-21 effective, and suggests this quant is quite well done.
For reference, on these documents DFlash falls apart at about the 2nd token position and Qwen3.6 is optimal at 3. With this rate of acceptance, the actual throughput is better than fully optimized Qwen3.6-27B with MTP=3 (further magnified with about 40% less thinking).
On benchmarks, code, and especially structured output it is even better, as you would expect. About 40 tok/s is nearly a 300% increase from baseline:
╔══════════════════════════════════════════════════════╗
║ Benchmark: Gemma4-31b-it — 2026-05-07 00:16
╚══════════════════════════════════════════════════════╝
Warm-up... done
── Sequential (1 request) ──────────────────────────────
Run 1/2:
[Q&A ] 256 tokens in 6.54s = 39.1 tok/s
[Code ] 512 tokens in 12.08s = 42.3 tok/s
[JSON ] 1024 tokens in 19.69s = 52.0 tok/s
[Math ] 32 tokens in 1.03s = 30.9 tok/s
[LongCode ] 2048 tokens in 49.74s = 41.1 tok/s
Run 2/2:
[Q&A ] 256 tokens in 6.55s = 39.0 tok/s
[Code ] 512 tokens in 12.12s = 42.2 tok/s
[JSON ] 1024 tokens in 19.69s = 51.9 tok/s
[Math ] 32 tokens in 1.04s = 30.7 tok/s
[LongCode ] 2048 tokens in 49.84s = 41.0 tok/s
── Concurrent (4 parallel requests) ───────────────────────────
Sending 4 requests simultaneously, measuring total throughput...
[req1 ] 1024 tokens = 30.3 tok/s (end-to-end)
[req2 ] 1024 tokens = 31.3 tok/s (end-to-end)
[req3 ] 1024 tokens = 29.6 tok/s (end-to-end)
[req4 ] 1024 tokens = 31.5 tok/s (end-to-end)
Total: 4096 tokens in 34.59s
Total throughput: 118.3 tok/s (4 requests completed)
Rebuild spark-vllm-docker with the --rebuild-vllm flag to pick up the recently merged Gemma4 MTP code, then here is my startup (1.2 million tokens available in KV cache; I run the container detached in case of disconnect):
#!/bin/bash
~/containers/spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 --solo -d \
exec vllm serve ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ \
--served-model-name Gemma4-31b-it \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--port 8000 \
--host 0.0.0.0 \
--max-num-seqs 4 \
--quantization compressed-tensors \
--kv-cache-dtype fp8_e4m3 \
--max-num-batched-tokens 16384 \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--load-format instanttensor \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--speculative-config '{"method":"mtp","model":"google/gemma4-31b-it-assistant","num_speculative_tokens":7}'