Gemma 4 MTP

trying to run Google Gemma 4 MTP:

The blog mentions vLLM support, so I tried with a recent vLLM build (also eugr’s fork), but hit this:

Value error, The checkpoint you are trying to load has model type `gemma4_assistant` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

Looks like Transformers doesn’t support gemma4_assistant yet.
Has anyone managed to launch this with vLLM?

See here, ongoing discussion: Gemma4 draft models are now available

You need a specific PR if you want to play with them right now.

@takashii, got Gemma 4 + MTP running earlier today against nvidia/Gemma-4-26B-A4B-NVFP4 paired with google/gemma-4-26B-A4B-it-assistant. Posting the working recipe in case it saves anyone else the round trip.

The PR jwarner mentioned is vllm-project/vllm#41745 (Gemma4 MTP speculative decoding). It got its first non-bot APPROVED review about 30 min ago, so it should land soon. In the meantime, three things need to be in place before MTP loads on the NVFP4 build:

1. Transformers needs to be on main, not the released 5.7.0

The gemma4_assistant model_type isn’t in any tagged transformers release yet. That’s the exact error you hit. Upgrade inside your vLLM image:

docker run --rm -d --name vllm-tx-fix --entrypoint /bin/bash <your-vllm-image> -c 'sleep 600'
docker exec vllm-tx-fix pip install -U \
    "https://github.com/huggingface/transformers/archive/main.tar.gz"
docker commit \
    --change 'ENTRYPOINT ["vllm","serve"]' \
    --change 'WORKDIR /vllm-workspace' \
    vllm-tx-fix <your-vllm-image>
docker rm -f vllm-tx-fix

That gives you transformers 5.8.0.dev0 with gemma4_assistant registered. The tarball install avoids needing git inside the container.

2. Pull the latest PR head, not the original

Two bugs in the initial PR commit blocked NVFP4 + assistant pairing. Both got fixed today after the testing surfaced them:

  • intermediate_size was being read from the top-level config instead of text_config, so the MLP got constructed at half size (4096 vs 8192). Fixed in commit c43152713e.
  • quant_config from the NVFP4 target was propagating into the draft’s BF16 Linears, so vLLM allocated the draft’s MLP/Q/O params with NVFP4 packing applied while the assistant ships unpacked BF16 weights. Fixed in commit 5119058403.

Latest head as of this post is 5119058403. To cherry-pick onto your local main:

cd ~/vllm
git fetch origin pull/41745/head:pr-41745-gemma4-mtp
git checkout pr-41745-gemma4-mtp
~/vllm-server/build/build-local-image.sh   # or your usual image build

When the PR officially merges (probably tomorrow), drop the cherry-pick branch and a fresh git pull main will get you the same code.

3. MoE backend will auto-pick MARLIN

Every FlashInfer variant and VLLM_CUTLASS reject Gemma 4’s MoEActivation.GELU_TANH. Don’t pass --moe-backend cutlass explicitly or you’ll hit a hard ValueError; just omit --moe-backend and auto-detect walks the list and lands on Marlin. NVIDIA’s own model card calls this out (Marlin or VLLM_CUTLASS, with Flashinfer-TRTLLM behind PR #41050, which merged but doesn’t appear to clear the activation rejections yet). Marlin runs the FP4 weights via decompress-to-FP16, so it carries some overhead, but it loads.

Working launch (26B, TP=1)

docker run -d --name vllm-gemma4-26b-server --gpus all --ipc=host --network host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v ~/.cache/vllm:/root/.cache/vllm \
    -v ~/.cache/flashinfer:/root/.cache/flashinfer \
    -e TORCH_CUDA_ARCH_LIST=12.1a \
    -e VLLM_SKIP_P2P_CHECK=1 \
    -e FLASHINFER_JIT_LOG_LEVEL=ERROR \
    <your-vllm-image-with-fixes> \
    nvidia/Gemma-4-26B-A4B-NVFP4 \
        --port 8000 --host 0.0.0.0 \
        --gpu-memory-utilization 0.55 \
        --kv-cache-dtype fp8 \
        --max-model-len 262144 \
        --max-num-seqs 4 \
        --max-num-batched-tokens 16384 \
        --tensor-parallel-size 1 \
        --enforce-eager \
        --trust-remote-code \
        --enable-chunked-prefill \
        --enable-prefix-caching \
        --no-enable-flashinfer-autotune \
        --tool-call-parser gemma4 \
        --reasoning-parser gemma4 \
        --enable-auto-tool-choice \
        --speculative-config '{"method":"mtp","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":4}'

--gpu-memory-utilization 0.55 gives ~26 GiB KV cache (1.7M tokens), which covers 4 concurrent max-context slots with ~0.6x extra headroom on the unified-memory budget. Drop a bit if you need more host memory free; the working set is comfortable down to ~0.50 too.

Numbers from my run

SpecDecoding metrics lines from the engine, averaged across two bench runs:

  • Mean acceptance length: 3.68 / 4 tokens
  • Per-position acceptance: 0.85 / 0.72 / 0.62 / 0.51
  • Avg draft acceptance rate: 67-69%
  • Peak aggregate generation throughput: ~175 tok/s (vs ~104 tok/s without MTP)

End-to-end speedup over no-MTP baseline:

  • Sequential ×3 wall: 23.2 s → 9.9 s (2.34x)
  • Concurrent ×8 wall: 25.9 s → 13.6 s (1.91x)
  • Avg sequential tok/s: 23.2 → 42.6 (+84%)

For the longer context: yes, the BF16 assistant transfers cleanly to the NVFP4 target despite quantization noise. ~67% acceptance is well above the threshold where MTP is net-positive. benchislett posted comparable results on a B300 with the 31B model + γ=7 in the PR thread (350 vs 305 TPS).

Hope this saves you the bug-hunt I did. Happy to dig into specific failures if you hit something different.

There was a new Gemma4 vllm docker image pushed a few hours ago - I presume it includes the PR: