Maybe it’s just to early to test, but the official announcement mentions vLLM also in its list of inference servers.
The blog post refers to the regular vLLM playbook which refers to a vLLM version 26.02 which won’t have Transformers v5.5.0 which is needed for Gemma4.
But Transformers v5.5.0 doesn’t seem to be sufficient. I did a fresh rebuild eugr’s edition:
./build-and-copy.sh --tf5 -t eugr/vllm-node:20260402-tf5
Which pulls the v5.5.0:
root@bb143232f90c:/workspace/vllm# uv pip list |grep transf
Using Python 3.12.3 environment at: /usr
transformers 5.5.0
But…
EngineCore pid=110) WARNING 04-02 17:05:46 [utils.py:188] TransformersMultiModalMoEForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
It seems there is a piece for vLLM missing the fallback to Transformers ends with:
(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers/base.py", line 218, in _patch_config
(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108] if sub_config.dtype != (dtype := self.config.dtype):
(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108] ^^^^^^^^^^^^^^^^
(EngineCore pid=110) ERROR 04-02 17:06:05 [core.py:1108] AttributeError: 'NoneType' object has no attribute 'dtype
I tried the 26B version:
Ok. NVIDIA also recommends the NIM:
Which is amd64 only for now.
So. Has anyone seen an open PR for vLLM? :-D
llama.cpp has already gotten its support. That would be my fallback.
First comparisons are in:
EDIT: the post has been removed from reddit by the mods.
And fully blown announcement: