Gemma 4 Models - which vLLM version? Any PRs spotted?

Maybe it’s just to early to test, but the official announcement mentions vLLM also in its list of inference servers.

The blog post refers to the regular vLLM playbook which refers to a vLLM version 26.02 which won’t have Transformers v5.5.0 which is needed for Gemma4.

But Transformers v5.5.0 doesn’t seem to be sufficient. I did a fresh rebuild eugr’s edition:

 ./build-and-copy.sh --tf5 -t eugr/vllm-node:20260402-tf5

Which pulls the v5.5.0:

root@bb143232f90c:/workspace/vllm# uv pip list |grep transf
Using Python 3.12.3 environment at: /usr
transformers                             5.5.0

But…

EngineCore pid=110) WARNING 04-02 17:05:46 [utils.py:188] TransformersMultiModalMoEForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.

It seems there is a piece for vLLM missing the fallback to Transformers ends with:

(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers/base.py", line 218, in _patch_config
(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108]     if sub_config.dtype != (dtype := self.config.dtype):
(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108]        ^^^^^^^^^^^^^^^^
(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108] AttributeError: 'NoneType' object has no attribute 'dtype

I tried the 26B version:

Ok. NVIDIA also recommends the NIM:

Which is amd64 only for now.

So. Has anyone seen an open PR for vLLM? :-D

llama.cpp has already gotten its support. That would be my fallback.

First comparisons are in:

EDIT: the post has been removed from reddit by the mods.

And fully blown announcement:

This is from Luciano on the DeepMind DevRel team: feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) by lucianommartins · Pull Request #38826 · vllm-project/vllm · GitHub – I am building eugr’s container with that and TF5 support right now and hope to get it up and running.

Ok let’s wait for Luciano then…

I built @eugr’s spark-vllm-docker using --apply-vllm-pr but was not able to get it to work yet. In the meantime the PR was merged, so I’ll try to spend some more time on this tomorrow once I had some sleep.

The container complains about not knowing the gemma4 tool parser via Transformers. As far as I understand, we need Transformers 5.5.0 which should be installed given I used the --pre-tf command.

vLLM has a gemma4 docker build.

also for arm64

I’ll run now the build pipeline so Gemma 4 PRs are included.

Awesome !!!

Seems to be Great models

Looks like things are merged upstream so @eugr’s ./build-and-copy.sh --rebuild-vllm --tf5 -c seems to work great. I quickly hacked up recipe containing container: vllm-node-tf5 and vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --tool-call-parser gemma4 --reasoning-parser gemma4 \ ... and it booted up. Inference was dreadfully slow so I’m sure I’m missing something–wait for the experts to jump in, ha! (I’m still learning the ropes, apologies in advance.)

Edit: Switched to NV quant.

The official gemma4 build loads the 26B model, but complains about the chat-template using the example args:

(APIServer pid=10) vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.

When I add the one included in the repo, I get:

33c51c2dbf7e7494b2c00505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050...

Interesting. I wait for eugr’s build.

Any one running the NIM gemma4 (nvcr.io/nim/google/gemma-4-31b-it:latest).
It says it should run on one H100, but fails for me (not enough VRAM for vcache).

How to pass arguments to vLLM

I try this, but no effects:

  - "--tensor-parallel-size"

  - "1"

  - "--max-model-len"

  - "16384"

  - "--gpu-memory-utilization"

  - "0.95"

My guess is the official NIM is probably serving the official Nvidia NVFP4 quant which was just released:

nvidia/Gemma-4-31B-IT-NVFP4

Worth trying with eugr’s repo (Marlin or CUTLASS backends, whichever works)

Hello,

llm-benchy of the FP16 IT version of the dense model gemma-4-31B-it of course not optimize but reference :
┌────────────────────────┬─────────────────────────────────────────────────────────────┐
│ Parameter │ Value │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Model │ google/gemma-4-31B-it │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Size │ 31B params, bf16 (no quantization) │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Context window │ 262144 (256K) │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ KV cache dtype │ fp8 │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Concurrency │ 1 (single request) │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ pp tested │ 128, 512, 2048 tokens │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ tg tested │ 128 tokens │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ depth │ 0 (no prior context) │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Runs │ 3 per config │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Prefix caching │ enabled │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Chunked prefill │ enabled, max 4096 batched tokens │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ max_num_seqs │ 4 │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ gpu_memory_utilization │ 0.70 │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Backend │ TRITON_ATTN (forced by official image for Gemma4 │
│ │ heterogeneous head dims) │
└────────────────────────┴─────────────────────────────────────────────────────────────┘

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
google/gemma-4-31B-it pp128 244.52 ± 46.34 546.74 ± 90.59 543.35 ± 90.59 546.84 ± 90.60
google/gemma-4-31B-it tg128 3.66 ± 0.06 4.00 ± 0.00
google/gemma-4-31B-it pp512 757.46 ± 67.04 686.35 ± 64.44 682.96 ± 64.44 686.43 ± 64.44
google/gemma-4-31B-it tg128 3.70 ± 0.00 4.00 ± 0.00
google/gemma-4-31B-it pp2048 1066.35 ± 47.69 1928.86 ± 88.87 1925.48 ± 88.87 1928.96 ± 88.87
google/gemma-4-31B-it tg128 3.67 ± 0.00 4.00 ± 0.00

Will try this…

Cannot wait to see the benchmarks for nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face vs google/gemma-4-31B-it · Hugging Face vs google/gemma-4-31B · Hugging Face on a Spark.

Benchmark results look very similar % to Qwen/Qwen3.5-35B-A3B · Hugging Face on the surface, with 0-day features :-). So maybe we can expect ~50ish Tokens/Sec as well on Spark Arena - LLM Leaderboard ?

Not with the dense 31B.

The 26B-A4B definitely has that kind of potential, though.

We have just updated our vLLM and llama.cpp playbooks to use Gemma4 on DGX Spark. Check them out here: VLLM, Llama.cpp

OK, the new build is complete. If anyone wants to test it with Gemma - please do. If not, I’ll do it later today when I have free time.

I tried both @eugr vllm and the recipe published by nvidia. both of them failed with same error: (APIServer pid=41) TypeError: Gemma4ToolParser.init() takes 2 positional arguments but 3 were given. I am using --tool-call-parser gemma4 \

--reasoning-parser gemma4

A PR to fix that was recently merged - [Bug]: Gemma4ToolParser.__init__() missing `tools` parameter — 400 error on tool calls · Issue #38837 · vllm-project/vllm · GitHub

Looks like this will come in BF16 (16-bit) SFP8 (8-bit) and a Q4_0 (4-bit) varieties…

https://ai.google.dev/gemma/docs/core#:~:text=Gemma%20is%20a%20family%20of,high-throughput%2C%20advanced%20reasoning.