@takashii, got Gemma 4 + MTP running earlier today against nvidia/Gemma-4-26B-A4B-NVFP4 paired with google/gemma-4-26B-A4B-it-assistant. Posting the working recipe in case it saves anyone else the round trip.
The PR jwarner mentioned is vllm-project/vllm#41745 (Gemma4 MTP speculative decoding). It got its first non-bot APPROVED review about 30 min ago, so it should land soon. In the meantime, three things need to be in place before MTP loads on the NVFP4 build:
1. Transformers needs to be on main, not the released 5.7.0
The gemma4_assistant model_type isn’t in any tagged transformers release yet. That’s the exact error you hit. Upgrade inside your vLLM image:
docker run --rm -d --name vllm-tx-fix --entrypoint /bin/bash <your-vllm-image> -c 'sleep 600'
docker exec vllm-tx-fix pip install -U \
"https://github.com/huggingface/transformers/archive/main.tar.gz"
docker commit \
--change 'ENTRYPOINT ["vllm","serve"]' \
--change 'WORKDIR /vllm-workspace' \
vllm-tx-fix <your-vllm-image>
docker rm -f vllm-tx-fix
That gives you transformers 5.8.0.dev0 with gemma4_assistant registered. The tarball install avoids needing git inside the container.
2. Pull the latest PR head, not the original
Two bugs in the initial PR commit blocked NVFP4 + assistant pairing. Both got fixed today after the testing surfaced them:
intermediate_size was being read from the top-level config instead of text_config, so the MLP got constructed at half size (4096 vs 8192). Fixed in commit c43152713e.
quant_config from the NVFP4 target was propagating into the draft’s BF16 Linears, so vLLM allocated the draft’s MLP/Q/O params with NVFP4 packing applied while the assistant ships unpacked BF16 weights. Fixed in commit 5119058403.
Latest head as of this post is 5119058403. To cherry-pick onto your local main:
cd ~/vllm
git fetch origin pull/41745/head:pr-41745-gemma4-mtp
git checkout pr-41745-gemma4-mtp
~/vllm-server/build/build-local-image.sh # or your usual image build
When the PR officially merges (probably tomorrow), drop the cherry-pick branch and a fresh git pull main will get you the same code.
3. MoE backend will auto-pick MARLIN
Every FlashInfer variant and VLLM_CUTLASS reject Gemma 4’s MoEActivation.GELU_TANH. Don’t pass --moe-backend cutlass explicitly or you’ll hit a hard ValueError; just omit --moe-backend and auto-detect walks the list and lands on Marlin. NVIDIA’s own model card calls this out (Marlin or VLLM_CUTLASS, with Flashinfer-TRTLLM behind PR #41050, which merged but doesn’t appear to clear the activation rejections yet). Marlin runs the FP4 weights via decompress-to-FP16, so it carries some overhead, but it loads.
Working launch (26B, TP=1)
docker run -d --name vllm-gemma4-26b-server --gpus all --ipc=host --network host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/.cache/vllm:/root/.cache/vllm \
-v ~/.cache/flashinfer:/root/.cache/flashinfer \
-e TORCH_CUDA_ARCH_LIST=12.1a \
-e VLLM_SKIP_P2P_CHECK=1 \
-e FLASHINFER_JIT_LOG_LEVEL=ERROR \
<your-vllm-image-with-fixes> \
nvidia/Gemma-4-26B-A4B-NVFP4 \
--port 8000 --host 0.0.0.0 \
--gpu-memory-utilization 0.55 \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-seqs 4 \
--max-num-batched-tokens 16384 \
--tensor-parallel-size 1 \
--enforce-eager \
--trust-remote-code \
--enable-chunked-prefill \
--enable-prefix-caching \
--no-enable-flashinfer-autotune \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--enable-auto-tool-choice \
--speculative-config '{"method":"mtp","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":4}'
--gpu-memory-utilization 0.55 gives ~26 GiB KV cache (1.7M tokens), which covers 4 concurrent max-context slots with ~0.6x extra headroom on the unified-memory budget. Drop a bit if you need more host memory free; the working set is comfortable down to ~0.50 too.
Numbers from my run
SpecDecoding metrics lines from the engine, averaged across two bench runs:
- Mean acceptance length: 3.68 / 4 tokens
- Per-position acceptance: 0.85 / 0.72 / 0.62 / 0.51
- Avg draft acceptance rate: 67-69%
- Peak aggregate generation throughput: ~175 tok/s (vs ~104 tok/s without MTP)
End-to-end speedup over no-MTP baseline:
- Sequential ×3 wall: 23.2 s → 9.9 s (2.34x)
- Concurrent ×8 wall: 25.9 s → 13.6 s (1.91x)
- Avg sequential tok/s: 23.2 → 42.6 (+84%)
For the longer context: yes, the BF16 assistant transfers cleanly to the NVFP4 target despite quantization noise. ~67% acceptance is well above the threshold where MTP is net-positive. benchislett posted comparable results on a B300 with the 31B model + γ=7 in the PR thread (350 vs 305 TPS).
Hope this saves you the bug-hunt I did. Happy to dig into specific failures if you hit something different.