Gemma 4 Models - which vLLM version? Any PRs spotted?

cosinus · April 5, 2026, 3:01pm

Of course you can install it as every normal python program. You can find very extensive documentation on the official docs site:


# Install vLLM with a specific CUDA version (e.g., 13.0).
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
export CUDA_VERSION=130 # or other
export CPU_ARCH=$(uname -m) # x86_64 or aarch64
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux_2_35_${CPU_ARCH}.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}

This will install the latest release. No useful patches that might came up later.

BTW: There is no CUDA 12.1a - 12.1a refers to the architecture (GB10) or compute capability as it is called in the official documentation:

But be aware - when not using the community eugr edition of vLLM you will miss some of the patches that improve the overall performance and already fixes annoying bugs that spoil the fun, especially with Gemma4. vLLM has still open issues for Gemma4.

EDIT: You might need to update the transformers version manually. Last time I checked the official build it was still using an older transformerversion as needed (>=5.5.0) for Gemma4.

Anglerfish · April 6, 2026, 1:02pm

Total newbie here. Everyone one of you knows way more than me but I wanted to share my experience. Yesterday I decided to give @eugr ‘s implementation using the sparkrun software package software. It loaded it in vLLM with about 116GB of RAM usage. I opened up a basic docker container with open-webui to “chat” with it. Oh my goodness!! I have NEVER seen a model respond so quickly, it was instantaneous. No matter what questions I threw at it. Advanced physics, engineering, electrical design, random puzzles, like a champ, I chatted with it for several HOURS same session never slowed down never went sideways. My cache hit usage hit high 90’s and honestly I forgot to look at tokens per second. But man I have seen nothing like it so far on the spark. For at least a semi-large (ish) model. I only have 1 spark. Sadly it uses too much memory when I tried to fire up my other models to run in Agent Zero, I was going to replace my Qwen3 model (also from Eugr) with Gemma to test it out but I run out of memory and it shuts itself down. I know I could probably change the amount of usage down a bit and make it work but I dont want to lose any of the “brain” of gemma4.

It’s one of two models that answered the “car wash riddle” correctly. I have tested a lot of models too! If you are unfamiliar with the riddle give it a try in your favorite LLM to see if it can figure it out it goes as follows:
I live next to a car wash. My car is very dirty. It needs a wash. Should I walk or drive to the car wash? —that’s it. Many models will say to walk which well you know misses the point. Some go on and rant about how its safer and better for the environment blah blah you name it. Usually it will self correct if you reply back with “well if I walk should I carry my car on my back?” most figure it out then some throw weird answers like “trying to carry on your back is dangerous” lol. Anyways, TL/DR Gemma 4 on vLLM with Eugr’s not so secret sauce is amazing. If we could just reduce the RAM down some more without loosing the speed and accuracy that would awesome. Come on Eugr I know you can do it! :) Cheers all!

cosinus · April 6, 2026, 1:49pm

If you’re using the recipe of eugr:

github.com/eugr/spark-vllm-docker

recipes/gemma4-26b-a4b.yaml

main

# Recipe: Gemma4-26B-A4B
# Gemma4-26B-A4B model in online FP8 quantization

recipe_version: "1"
name: Gemma4-26B-A4B
description: vLLM serving Gemma4-26B-A4B

# HuggingFace model to download (optional, for --download-model)
model: google/gemma-4-26B-A4B-it

# Only cluster is supported
cluster_only: false
solo_only: false

# Container image to use
container: vllm-node-tf5

build_args:
  - --tf5

This file has been truncated. show original

you are using the fully blown BF16 weights (51.6 GB). There is also a AWQ 4bit variant:

cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

which reduces weights to 17.2 GB. Or Intels Autoround:

Intel/gemma-4-26B-A4B-it-int4-AutoRound

which should work just fine. And there a bunch of NVFP4 variants:

bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4

which might be slower for now. Some devs are working on improving issues of NVFP4 with GB10.

Anglerfish · April 6, 2026, 2:47pm

I need to re pull the repo that’s my issue. My recipes folder doesn’t show Gemma

I must have “thought” I was using Eugr’s vllm but not really. I suck at all this :( thank you will re-download the repo!

cheers brother

carlos.albarran.mx · April 6, 2026, 4:41pm

Any update for Gemma 4 31B running on a single Spark DGX? @eugr
Appreciate your comments… I feel that it might be necessary to have an optimized version for 31B.

adg1 · April 6, 2026, 5:21pm

I regret having to make a reality check, but Gemma 4 31B is a dense model. With 31B active parameters and 273 GB/s memory bandwidth on Spark we are not going very far. Granted, there is still more tok/s to be squeezed with more aggressive quants. But as long with retain at least 4 bits quants the inference will remain sluggish at best. 🤷🏻‍♂️ At least this is the expected behaviour on a single Spark; dense model will benefit from parallelism and, there, YMMV indeed.

Digital_David · April 6, 2026, 6:19pm

@dbsci Lots of broken stuff on
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit - Spark Arena Benchmark cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

when attempting to run in spark-vllm-docker setup :-(

Any advice or can it be fixed to run as a recipe?

say3 · April 6, 2026, 6:46pm

RedHatAI/gemma-4-26B-A4B-it-NVFP4 Published,I’m trying

| model                             |            test |              t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:----------------------------------|----------------:|-----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| RedHatAI/gemma-4-26B-A4B-it-NVFP4 |          pp2048 | 4184.12 ± 832.98 |              |  565.84 ± 107.14 |  510.85 ± 107.14 |  565.93 ± 107.14 |
| RedHatAI/gemma-4-26B-A4B-it-NVFP4 |            tg32 |     48.26 ± 0.36 | 49.83 ± 0.37 |                  |                  |                  |
| RedHatAI/gemma-4-26B-A4B-it-NVFP4 |  pp2048 @ d4096 | 5398.85 ± 328.84 |              |  1256.10 ± 72.73 |  1201.10 ± 72.73 |  1256.20 ± 72.73 |
| RedHatAI/gemma-4-26B-A4B-it-NVFP4 |    tg32 @ d4096 |     47.42 ± 0.09 | 48.96 ± 0.09 |                  |                  |                  |
| RedHatAI/gemma-4-26B-A4B-it-NVFP4 |  pp2048 @ d8192 | 5109.80 ± 136.08 |              |  2202.58 ± 46.83 |  2147.58 ± 46.83 |  2202.66 ± 46.82 |
| RedHatAI/gemma-4-26B-A4B-it-NVFP4 |    tg32 @ d8192 |     46.55 ± 0.14 | 48.06 ± 0.15 |                  |                  |                  |
| RedHatAI/gemma-4-26B-A4B-it-NVFP4 | pp2048 @ d16384 |  3995.55 ± 20.73 |              |  5024.45 ± 38.28 |  4969.45 ± 38.28 |  5024.53 ± 38.28 |
| RedHatAI/gemma-4-26B-A4B-it-NVFP4 |   tg32 @ d16384 |     39.66 ± 0.07 | 40.94 ± 0.07 |                  |                  |                  |
| RedHatAI/gemma-4-26B-A4B-it-NVFP4 | pp2048 @ d32768 |   2615.95 ± 5.56 |              | 14445.20 ± 61.52 | 14390.21 ± 61.52 | 14445.29 ± 61.52 |
| RedHatAI/gemma-4-26B-A4B-it-NVFP4 |   tg32 @ d32768 |     38.72 ± 0.05 | 39.97 ± 0.05 |                  |                  |                  |

llama-benchy (0.3.5)
date: 2026-04-07 03:21:04 | latency mode: generation

dbsci · April 6, 2026, 8:43pm

I guess make sure that you’re using the latest version of spark-vllm-docker.

To run via sparkrun, it’s:

sparkrun run @experimental/gemma4-26b-a4b-awq4-vllm

That’s the exact same recipe.

I ran and submitted the benchmark to spark arena by running:

sparkrun arena benchmark @experimental/gemma4-26b-a4b-awq4-vllm

Digital_David · April 6, 2026, 8:56pm

Understand, I’m not running sparkrun, but newest spark-vllm-docker and followed the instructions on the Spark Arena - LLM Leaderboard which does not seem to function when following the How to Use instructions :-(

dbsci · April 6, 2026, 9:03pm

Do you have particular logs/issues you could share?

I’ll try it out directly with the run-recipe in spark-vllm-docker.

Note that the working version of the plan is for sparkrun to be the authoritative means of running recipes, so I always recommend that first. sparkrun uses spark-vllm-docker to provide vllm, but supports more complexity and growth in what we can do with recipes.

Digital_David · April 6, 2026, 9:12pm

The --setup works, docker created, when the recipe starts to load in --solo mode, it starts with missing “Error: Recipe missing required field: name” and then half dozen items missing for not found if i manually add a name field. Most of it seems to be items listed in defaults not loading into the command.

~/spark-vllm-docker$ ./run-recipe.sh gemma4-26b-a4b-AWQ --solo --setup --served-model-name cyankiwi/gemma4-26b-a4b-AWQ
Warning: Recipe uses schema version ‘2’, but this run-recipe.py supports: [‘1’]
Some features may not work correctly. Consider updating run-recipe.py.
Recipe: Gemma4-26B-A4B-AWQ

Container ‘vllm-node-gem-26b-awq’ already exists locally.

Model ‘cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit’ already exists in cache.

Error: Missing parameter in recipe command: ‘model’
Available parameters: [‘port’, ‘host’, ‘tensor_parallel’, ‘gpu_memory_utilization’, ‘max_model_len’, ‘max_num_batched_tokens’, ‘tool_call_parser’]

dbsci · April 6, 2026, 9:19pm

I see. I was also just trying it – and now I see. Spark Arena modified the defaults in the recipe.

Original recipe: recipe-registry/experimental-recipes/gemma4/gemma4-26b-a4b-awq4-vllm.yaml at main · spark-arena/recipe-registry · GitHub

The original recipe has more fields in the defaults…

Here is one to quickly convert to v1:

recipe_version: "1"
model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
runtime: vllm
solo_only: false
cluster_only: false
container: vllm-node-tf5
name: gemma4-26b-a4b-awq4-vllm
description: Gemma4-26B-A4B-INT4-AWQ — AWQ quantized Gemma 4 26B MoE
build_args:
 - "--tf5"
mods: []
defaults:
  model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  pipeline_parallel: 1
  gpu_memory_utilization: 0.8
  max_model_len: 262144
  max_num_batched_tokens: 8192
  kv_cache_dtype: fp8
  tool_call_parser: "gemma4"
  reasoning_parser: "gemma4"
  load_format: "fastsafetensors"

env:
  VLLM_MARLIN_USE_ATOMIC_ADD: "1"

command: |
  vllm serve {model} \
    --max-model-len {max_model_len} \
    --kv-cache-dtype {kv_cache_dtype} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --host {host} \
    --port {port} \
    --load-format {load_format} \
    --enable-prefix-caching \
    --tool-call-parser {tool_call_parser} \
    --enable-auto-tool-choice \
    --reasoning-parser {reasoning_parser} \
    --async-scheduling \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --trust-remote-code \
    -tp {tensor_parallel} \
    -pp {pipeline_parallel}

Digital_David · April 6, 2026, 9:28pm

Thank you, corrected and model started :-) I’m sure I can’t be the only one with the issue…

dbsci · April 6, 2026, 9:32pm

We’ll get it resolved on the Spark Arena side. Although I also recommend sparkrun ;-)

raphael.amorim · April 6, 2026, 9:33pm

Thanks for reporting that, there’s a bug right now where Version 2 recipes (Sparkrun only) are showing instructions to run them on spark-vllm-docker. I’ll fix it.

josephbreda · April 7, 2026, 1:23am

I’ll save you a really long explanation on this (which you can find in abundance elsewhere on this forum):

-NVFP4 is not performant on DGX Spark (still/yet)
-AWQ/Autoround quants almost always offer faster speed and equal or better quality
-The boost NVFP4 is supposed to provide on Blackwell doesn’t work on our Blackwells.

ben3241 · April 7, 2026, 1:52am

On the vLLM side, we’re tracking a few reasoning parser and tool call parser fixes for streaming responses with Gemma4 as well as some general chat template issues with the model that impact all inference servers in multi-turn agentic workflows that we’re sorting out. These should all make their way into vLLM main and releases shortly, but feel free to ping me (bbrowning) on issues opened in vLLM’s github if you hit specific issues that need triage and fixing there.

As a fellow DGX Spark daily driver, thanks for being so on top of these things!

raphael.amorim · April 7, 2026, 2:08am

Thanks for all the good work, Ben!

Topic		Replies	Views
Google Gemma 4 - It will work on DGX Spark? DGX Spark / GB10 agentic-ai	22	1545	April 5, 2026
Gemma 4 Day-1 Inference on NVIDIA DGX Spark — Preliminary Benchmarks DGX Spark / GB10 llama , agentic-ai	11	3144	April 7, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4115	February 27, 2026
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	12	1957	January 22, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2678	December 31, 2025
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	3692	January 2, 2026
Someone post this: Gemma 4 26B-A4B MoE running at 45-60 tok/s on DGX Spark DGX Spark / GB10	4	947	April 5, 2026
How to run Gemma-4-NVFP4 in vLLM Docker? DGX Spark / GB10	9	1335	April 4, 2026
Who wants to be the hero and help a total newbie! Got a spark and um, yeah DGX Spark / GB10 nemotron	7	437	April 3, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	14025	March 24, 2026

Gemma 4 Models - which vLLM version? Any PRs spotted?

Related topics