Gemma 4 Models - which vLLM version? Any PRs spotted?

cosinus · April 2, 2026, 5:33pm

Maybe it’s just to early to test, but the official announcement mentions vLLM also in its list of inference servers.

The blog post refers to the regular vLLM playbook which refers to a vLLM version 26.02 which won’t have Transformers v5.5.0 which is needed for Gemma4.

But Transformers v5.5.0 doesn’t seem to be sufficient. I did a fresh rebuild eugr’s edition:

 ./build-and-copy.sh --tf5 -t eugr/vllm-node:20260402-tf5

Which pulls the v5.5.0:

root@bb143232f90c:/workspace/vllm# uv pip list |grep transf
Using Python 3.12.3 environment at: /usr
transformers                             5.5.0

But…

EngineCore pid=110) WARNING 04-02 17:05:46 [utils.py:188] TransformersMultiModalMoEForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.

It seems there is a piece for vLLM missing the fallback to Transformers ends with:

(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers/base.py", line 218, in _patch_config
(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108]     if sub_config.dtype != (dtype := self.config.dtype):
(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108]        ^^^^^^^^^^^^^^^^
(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108] AttributeError: 'NoneType' object has no attribute 'dtype

I tried the 26B version:

Ok. NVIDIA also recommends the NIM:

Which is amd64 only for now.

So. Has anyone seen an open PR for vLLM? :-D

llama.cpp has already gotten its support. That would be my fallback.

First comparisons are in:

EDIT: the post has been removed from reddit by the mods.

And fully blown announcement:

serapis · April 2, 2026, 6:02pm

This is from Luciano on the DeepMind DevRel team: feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) by lucianommartins · Pull Request #38826 · vllm-project/vllm · GitHub – I am building eugr’s container with that and TF5 support right now and hope to get it up and running.

carlos.albarran.mx · April 2, 2026, 6:32pm

Ok let’s wait for Luciano then…

serapis · April 2, 2026, 6:51pm

I built @eugr’s spark-vllm-docker using --apply-vllm-pr but was not able to get it to work yet. In the meantime the PR was merged, so I’ll try to spend some more time on this tomorrow once I had some sleep.

The container complains about not knowing the gemma4 tool parser via Transformers. As far as I understand, we need Transformers 5.5.0 which should be installed given I used the --pre-tf command.

cosinus · April 2, 2026, 6:54pm

vLLM has a gemma4 docker build.

also for arm64

eugr · April 2, 2026, 7:16pm

I’ll run now the build pipeline so Gemma 4 PRs are included.

giraudremi92 · April 2, 2026, 7:31pm

Awesome !!!

Seems to be Great models

withinrafael · April 2, 2026, 7:34pm

Looks like things are merged upstream so @eugr’s ./build-and-copy.sh --rebuild-vllm --tf5 -c seems to work great. I quickly hacked up recipe containing container: vllm-node-tf5 and vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --tool-call-parser gemma4 --reasoning-parser gemma4 \ ... and it booted up. Inference was dreadfully slow so I’m sure I’m missing something–wait for the experts to jump in, ha! (I’m still learning the ropes, apologies in advance.)

Edit: Switched to NV quant.

cosinus · April 2, 2026, 7:37pm

The official gemma4 build loads the 26B model, but complains about the chat-template using the example args:

(APIServer pid=10) vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.

When I add the one included in the repo, I get:

33c51c2dbf7e7494b2c00505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050505050...

Interesting. I wait for eugr’s build.

renambot1 · April 2, 2026, 7:42pm

Any one running the NIM gemma4 (nvcr.io/nim/google/gemma-4-31b-it:latest).
It says it should run on one H100, but fails for me (not enough VRAM for vcache).

How to pass arguments to vLLM

I try this, but no effects:

  - "--tensor-parallel-size"

  - "1"

  - "--max-model-len"

  - "16384"

  - "--gpu-memory-utilization"

  - "0.95"

jwarner · April 2, 2026, 7:56pm

My guess is the official NIM is probably serving the official Nvidia NVFP4 quant which was just released:

nvidia/Gemma-4-31B-IT-NVFP4

Worth trying with eugr’s repo (Marlin or CUTLASS backends, whichever works)

WilliamD · April 2, 2026, 8:01pm

Hello,

llm-benchy of the FP16 IT version of the dense model gemma-4-31B-it of course not optimize but reference :
┌────────────────────────┬─────────────────────────────────────────────────────────────┐
│ Parameter │ Value │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Model │ google/gemma-4-31B-it │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Size │ 31B params, bf16 (no quantization) │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Context window │ 262144 (256K) │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ KV cache dtype │ fp8 │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Concurrency │ 1 (single request) │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ pp tested │ 128, 512, 2048 tokens │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ tg tested │ 128 tokens │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ depth │ 0 (no prior context) │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Runs │ 3 per config │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Prefix caching │ enabled │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Chunked prefill │ enabled, max 4096 batched tokens │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ max_num_seqs │ 4 │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ gpu_memory_utilization │ 0.70 │
├────────────────────────┼─────────────────────────────────────────────────────────────┤
│ Backend │ TRITON_ATTN (forced by official image for Gemma4 │
│ │ heterogeneous head dims) │
└────────────────────────┴─────────────────────────────────────────────────────────────┘

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
google/gemma-4-31B-it	pp128	244.52 ± 46.34		546.74 ± 90.59	543.35 ± 90.59	546.84 ± 90.60
google/gemma-4-31B-it	tg128	3.66 ± 0.06	4.00 ± 0.00
google/gemma-4-31B-it	pp512	757.46 ± 67.04		686.35 ± 64.44	682.96 ± 64.44	686.43 ± 64.44
google/gemma-4-31B-it	tg128	3.70 ± 0.00	4.00 ± 0.00
google/gemma-4-31B-it	pp2048	1066.35 ± 47.69		1928.86 ± 88.87	1925.48 ± 88.87	1928.96 ± 88.87
google/gemma-4-31B-it	tg128	3.67 ± 0.00	4.00 ± 0.00

carlos.albarran.mx · April 2, 2026, 8:40pm

Will try this…

Digital_David · April 2, 2026, 8:47pm

Cannot wait to see the benchmarks for nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face vs google/gemma-4-31B-it · Hugging Face vs google/gemma-4-31B · Hugging Face on a Spark.

Benchmark results look very similar % to Qwen/Qwen3.5-35B-A3B · Hugging Face on the surface, with 0-day features :-). So maybe we can expect ~50ish Tokens/Sec as well on Spark Arena - LLM Leaderboard ?

jwarner · April 2, 2026, 9:18pm

Not with the dense 31B.

The 26B-A4B definitely has that kind of potential, though.

aniculescu · April 2, 2026, 9:20pm

We have just updated our vLLM and llama.cpp playbooks to use Gemma4 on DGX Spark. Check them out here: VLLM, Llama.cpp

eugr · April 2, 2026, 9:41pm

OK, the new build is complete. If anyone wants to test it with Gemma - please do. If not, I’ll do it later today when I have free time.

trithemius · April 2, 2026, 10:11pm

I tried both @eugr vllm and the recipe published by nvidia. both of them failed with same error: (APIServer pid=41) TypeError: Gemma4ToolParser.init() takes 2 positional arguments but 3 were given. I am using --tool-call-parser gemma4 \

--reasoning-parser gemma4

DannyTup · April 2, 2026, 10:17pm

A PR to fix that was recently merged - [Bug]: Gemma4ToolParser.__init__() missing `tools` parameter — 400 error on tool calls · Issue #38837 · vllm-project/vllm · GitHub

Digital_David · April 2, 2026, 10:35pm

Looks like this will come in BF16 (16-bit) SFP8 (8-bit) and a Q4_0 (4-bit) varieties…

https://ai.google.dev/gemma/docs/core#:~:text=Gemma%20is%20a%20family%20of,high-throughput%2C%20advanced%20reasoning.

Topic		Replies	Views
Gemma 4 31B on DGX Spark: Runtime FP8 Benchmarks — Single & Dual Node (TP=2) DGX Spark / GB10 llama , agentic-ai	0	1585	April 7, 2026
Google Gemma 4 - It will work on DGX Spark? DGX Spark / GB10 agentic-ai	23	2255	April 19, 2026
Gemma 4 Day-1 Inference on NVIDIA DGX Spark — Preliminary Benchmarks DGX Spark / GB10 llama , agentic-ai	17	6683	April 7, 2026
How to run Gemma-4-NVFP4 in vLLM Docker? DGX Spark / GB10	12	3949	April 12, 2026
46tok/s with RedHatAI/gemma-4-26B-A4B-it-NVFP4 DGX Spark / GB10 llama	16	938	April 12, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4309	February 27, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2922	December 31, 2025
Gemma 4 -- here we go again DGX Spark / GB10	11	2746	April 15, 2026
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	12	2064	January 22, 2026
Does anyone have Gemma 4 31B running on Spark DGX? DGX Spark / GB10	9	2172	April 23, 2026

Related topics