Gemma 4 MTP

takashii · May 5, 2026, 7:07pm

trying to run Google Gemma 4 MTP:

The blog mentions vLLM support, so I tried with a recent vLLM build (also eugr’s fork), but hit this:

Value error, The checkpoint you are trying to load has model type `gemma4_assistant` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

Looks like Transformers doesn’t support gemma4_assistant yet.
Has anyone managed to launch this with vLLM?

jwarner · May 5, 2026, 7:39pm

See here, ongoing discussion: Gemma4 draft models are now available

You need a specific PR if you want to play with them right now.

hospedales · May 5, 2026, 8:49pm

@takashii, got Gemma 4 + MTP running earlier today against nvidia/Gemma-4-26B-A4B-NVFP4 paired with google/gemma-4-26B-A4B-it-assistant. Posting the working recipe in case it saves anyone else the round trip.

The PR jwarner mentioned is vllm-project/vllm#41745 (Gemma4 MTP speculative decoding). It got its first non-bot APPROVED review about 30 min ago, so it should land soon. In the meantime, three things need to be in place before MTP loads on the NVFP4 build:

1. Transformers needs to be on main, not the released 5.7.0

The gemma4_assistant model_type isn’t in any tagged transformers release yet. That’s the exact error you hit. Upgrade inside your vLLM image:

docker run --rm -d --name vllm-tx-fix --entrypoint /bin/bash <your-vllm-image> -c 'sleep 600'
docker exec vllm-tx-fix pip install -U \
    "https://github.com/huggingface/transformers/archive/main.tar.gz"
docker commit \
    --change 'ENTRYPOINT ["vllm","serve"]' \
    --change 'WORKDIR /vllm-workspace' \
    vllm-tx-fix <your-vllm-image>
docker rm -f vllm-tx-fix

That gives you transformers 5.8.0.dev0 with gemma4_assistant registered. The tarball install avoids needing git inside the container.

2. Pull the latest PR head, not the original

Two bugs in the initial PR commit blocked NVFP4 + assistant pairing. Both got fixed today after the testing surfaced them:

intermediate_size was being read from the top-level config instead of text_config, so the MLP got constructed at half size (4096 vs 8192). Fixed in commit c43152713e.
quant_config from the NVFP4 target was propagating into the draft’s BF16 Linears, so vLLM allocated the draft’s MLP/Q/O params with NVFP4 packing applied while the assistant ships unpacked BF16 weights. Fixed in commit 5119058403.

Latest head as of this post is 5119058403. To cherry-pick onto your local main:

cd ~/vllm
git fetch origin pull/41745/head:pr-41745-gemma4-mtp
git checkout pr-41745-gemma4-mtp
~/vllm-server/build/build-local-image.sh   # or your usual image build

When the PR officially merges (probably tomorrow), drop the cherry-pick branch and a fresh git pull main will get you the same code.

3. MoE backend will auto-pick `MARLIN`

Every FlashInfer variant and VLLM_CUTLASS reject Gemma 4’s MoEActivation.GELU_TANH. Don’t pass --moe-backend cutlass explicitly or you’ll hit a hard ValueError; just omit --moe-backend and auto-detect walks the list and lands on Marlin. NVIDIA’s own model card calls this out (Marlin or VLLM_CUTLASS, with Flashinfer-TRTLLM behind PR #41050, which merged but doesn’t appear to clear the activation rejections yet). Marlin runs the FP4 weights via decompress-to-FP16, so it carries some overhead, but it loads.

Working launch (26B, TP=1)

docker run -d --name vllm-gemma4-26b-server --gpus all --ipc=host --network host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v ~/.cache/vllm:/root/.cache/vllm \
    -v ~/.cache/flashinfer:/root/.cache/flashinfer \
    -e TORCH_CUDA_ARCH_LIST=12.1a \
    -e VLLM_SKIP_P2P_CHECK=1 \
    -e FLASHINFER_JIT_LOG_LEVEL=ERROR \
    <your-vllm-image-with-fixes> \
    nvidia/Gemma-4-26B-A4B-NVFP4 \
        --port 8000 --host 0.0.0.0 \
        --gpu-memory-utilization 0.55 \
        --kv-cache-dtype fp8 \
        --max-model-len 262144 \
        --max-num-seqs 4 \
        --max-num-batched-tokens 16384 \
        --tensor-parallel-size 1 \
        --enforce-eager \
        --trust-remote-code \
        --enable-chunked-prefill \
        --enable-prefix-caching \
        --no-enable-flashinfer-autotune \
        --tool-call-parser gemma4 \
        --reasoning-parser gemma4 \
        --enable-auto-tool-choice \
        --speculative-config '{"method":"mtp","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":4}'

--gpu-memory-utilization 0.55 gives ~26 GiB KV cache (1.7M tokens), which covers 4 concurrent max-context slots with ~0.6x extra headroom on the unified-memory budget. Drop a bit if you need more host memory free; the working set is comfortable down to ~0.50 too.

Numbers from my run

SpecDecoding metrics lines from the engine, averaged across two bench runs:

Mean acceptance length: 3.68 / 4 tokens
Per-position acceptance: 0.85 / 0.72 / 0.62 / 0.51
Avg draft acceptance rate: 67-69%
Peak aggregate generation throughput: ~175 tok/s (vs ~104 tok/s without MTP)

End-to-end speedup over no-MTP baseline:

Sequential ×3 wall: 23.2 s → 9.9 s (2.34x)
Concurrent ×8 wall: 25.9 s → 13.6 s (1.91x)
Avg sequential tok/s: 23.2 → 42.6 (+84%)

For the longer context: yes, the BF16 assistant transfers cleanly to the NVFP4 target despite quantization noise. ~67% acceptance is well above the threshold where MTP is net-positive. benchislett posted comparable results on a B300 with the 31B model + γ=7 in the PR thread (350 vs 305 TPS).

Hope this saves you the bug-hunt I did. Happy to dig into specific failures if you hit something different.

DannyTup · May 5, 2026, 9:41pm

There was a new Gemma4 vllm docker image pushed a few hours ago - I presume it includes the PR:

Topic		Replies	Views
How to run Gemma-4-NVFP4 in vLLM Docker? DGX Spark / GB10	12	4160	April 12, 2026
Gemma 4 Models - which vLLM version? Any PRs spotted? DGX Spark / GB10 nim , llama	178	9460	April 30, 2026
Gemma4 draft models are now available DGX Spark / GB10 Projects	6	612	May 6, 2026
Google Gemma 4 - It will work on DGX Spark? DGX Spark / GB10 agentic-ai	23	2306	April 19, 2026
Gemma 4 -- here we go again DGX Spark / GB10	11	2817	April 15, 2026
Gemma 4 31B on DGX Spark: Runtime FP8 Benchmarks — Single & Dual Node (TP=2) DGX Spark / GB10 llama , agentic-ai	0	1720	April 7, 2026
No luck with Gemma 4 on Jetson Nano Super Jetson Orin Nano llm	9	966	May 5, 2026
Gemma 4 Day-1 Inference on NVIDIA DGX Spark — Preliminary Benchmarks DGX Spark / GB10 llama , agentic-ai	17	6962	April 7, 2026
Newb alert! Qwen 3.5/3.6 Gemma 4 26B / 35B downloading and speed! Help! DGX Spark / GB10	2	207	May 5, 2026
46tok/s with RedHatAI/gemma-4-26B-A4B-it-NVFP4 DGX Spark / GB10 llama	18	1119	May 6, 2026