Gemma 4 Models - which vLLM version? Any PRs spotted?

eugr · April 3, 2026, 4:55am

You should be getting better performance, even with a dense model.
I suggest to remove all the variables you set and remove --quantization compressed_tensors - not sure why would you put it there.

grindstone · April 3, 2026, 5:05am

 | model                       |   test |              t/s |    peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
 |:----------------------------|-------:|-----------------:|------------:|----------------:|----------------:|----------------:|
 | nvidia/Gemma-4-31B-IT-NVFP4 |  pp128 |  589.66 ± 110.11 |             |  231.35 ± 49.04 |  227.93 ± 49.04 |  231.40 ± 49.05 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |   tg32 |      6.91 ± 0.00 | 7.00 ± 0.00 |                 |                 |                 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  pp128 |    674.10 ± 2.91 |             |   194.79 ± 0.83 |   191.37 ± 0.83 |   194.84 ± 0.83 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  tg128 |      6.88 ± 0.04 | 7.00 ± 0.00 |                 |                 |                 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  pp512 | 1415.36 ± 216.58 |             |  375.64 ± 63.87 |  372.23 ± 63.87 |  375.69 ± 63.87 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |   tg32 |      6.87 ± 0.00 | 7.00 ± 0.00 |                 |                 |                 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  pp512 | 1412.71 ± 213.50 |             |  376.08 ± 63.06 |  372.66 ± 63.06 |  376.13 ± 63.06 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  tg128 |      6.82 ± 0.04 | 7.00 ± 0.00 |                 |                 |                 |
 | nvidia/Gemma-4-31B-IT-NVFP4 | pp2048 |  1717.18 ± 83.13 |             | 1199.56 ± 59.96 | 1196.14 ± 59.96 | 1199.62 ± 59.96 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |   tg32 |      6.80 ± 0.01 | 7.00 ± 0.00 |                 |                 |                 |
 | nvidia/Gemma-4-31B-IT-NVFP4 | pp2048 |   1771.15 ± 2.44 |             |  1160.30 ± 1.59 |  1156.88 ± 1.59 |  1160.37 ± 1.60 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  tg128 |      6.81 ± 0.01 | 7.00 ± 0.00 |                 |                 |                 |

 llama-benchy (0.3.3)
 date: 2026-04-03 07:00:52 | latency mode: api

Using https://github.com/johnnynunez/vllm/commits/main/

cosinus · April 3, 2026, 5:26am

captonn / cyanwiki also published his AWQ quants in 4bit:

and even an 8bit version for the dense one:

will test them later this morning.

mikee.gwu · April 3, 2026, 5:30am

Here are my llama-benchy results (with AI-assisted explanation for those who are’t experts (yet!)):

vLLM run as follows:

./launch-cluster.sh -t vllm-node-20260402-tf5-pr35568 --solo
exec vllm serve google/gemma-4-26B-A4B-it
–max-model-len auto
–gpu-memory-utilization 0.7
–enable-auto-tool-choice
–tool-call-parser gemma4
–reasoning-parser gemma4
–load-format fastsafetensors
–quantization fp8
–kv-cache-dtype fp8

Gemma-4-26B (A4B) – llama-benchy Results Summary

Ran benchmarks against an OpenAI-compatible endpoint (vLLM), single user (concurrency=1), testing different prompt sizes and context lengths up to ~229k tokens.

🔑 Key Findings (with plain-English explanations)

1. Output speed (“decode”) is strong and stable
👉 This is how fast the model generates tokens after it starts responding (i.e., the actual answer speed).

~38 tokens/sec at small context
Drops gradually to ~25 tokens/sec at ~229k context

Takeaway: Once the model starts answering, it’s consistently fast, even with large inputs.

2. Input processing (“prefill”) slows down a lot as context grows
👉 This is the time spent reading and understanding your input before it starts generating an answer.

~4600 tokens/sec at small context
~1100 t/s at 32k
~160 t/s at 229k

Takeaway: The bigger your prompt/history, the slower the model is at getting ready to answer.

3. Time-to-first-token (TTFT) becomes the real problem
👉 This is how long you wait before seeing the first word of the response.

~1.8s @ 8k
~14.5s @ 32k
~49s @ 65k
~180s @ 131k
~556s (~9 minutes) @ 229k

Takeaway:
Even though output speed is fine, startup delay becomes massive with large inputs → kills interactivity.

4. “Long context” = very large inputs (tens to hundreds of thousands of tokens)
👉 Think: huge chat histories, full documents, or large RAG payloads.

Performance drops are dominated by input processing + memory overhead, not output speed

Takeaway: Long-context workloads are expensive and slow to start, even if generation itself is okay.

5. Prefix caching didn’t noticeably help
👉 This feature is supposed to reuse previously processed input to speed things up.

Context load times still very high

Takeaway: Don’t assume caching will fix long input delays (at least in this setup). Some of this may be a test-artifact and needs more analysis.

🧠 Bottom Line

Fast answers once generation starts (~30–40 t/s)
Big delays before answers at larger input sizes
>100k token inputs are not interactive (minutes of wait time)
The main bottleneck is reading the input, not generating the output

💡 Practical Implications

Works well for:
- Chat, agents, typical prompts (<32k tokens)
Gets slow for:
- Large RAG contexts, long documents
Not practical for:
- Interactive use with very large inputs (100k+ tokens)
  
  I’m going to run OpenClaw with 32k-48k and see how that goes. Will share findings when I have them.

eugr · April 3, 2026, 6:11am

Can you post raw llama-benchy results?
Also, prefix caching may not be enabled by default for this model - you need to specify --enable-prefix-caching in vLLM parameters.

say3 · April 3, 2026, 6:18am

| model      |            test |             t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:-----------|----------------:|----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| gemma4:26b |          pp2048 | 1931.65 ± 24.60 |              |   1353.72 ± 8.83 |    961.42 ± 8.83 |   1353.72 ± 8.83 |
| gemma4:26b |            tg32 |    56.38 ± 1.07 | 58.47 ± 1.07 |                  |                  |                  |
| gemma4:26b |  pp2048 @ d4096 | 1918.86 ± 24.95 |              |  3345.81 ± 14.52 |  2953.51 ± 14.52 |  3345.81 ± 14.52 |
| gemma4:26b |    tg32 @ d4096 |    52.70 ± 4.80 | 54.63 ± 4.99 |                  |                  |                  |
| gemma4:26b |  pp2048 @ d8192 | 1941.17 ± 16.35 |              |  5219.77 ± 69.05 |  4827.47 ± 69.05 |  5219.77 ± 69.05 |
| gemma4:26b |    tg32 @ d8192 |    45.44 ± 0.85 | 47.09 ± 0.86 |                  |                  |                  |
| gemma4:26b | pp2048 @ d16384 |  1774.99 ± 6.09 |              |  9985.83 ± 46.53 |  9593.53 ± 46.53 |  9985.83 ± 46.53 |
| gemma4:26b |   tg32 @ d16384 |    46.22 ± 2.27 | 48.00 ± 2.30 |                  |                  |                  |
| gemma4:26b | pp2048 @ d32768 |  1542.44 ± 3.68 |              | 21115.29 ± 87.25 | 20722.98 ± 87.25 | 21115.29 ± 87.25 |
| gemma4:26b |   tg32 @ d32768 |    24.67 ± 0.37 | 25.33 ± 0.47 |                  |                  |                  |

llama-benchy (0.3.5)
date: 2026-04-03 14:15:18 | latency mode: generation

Benchmark on Ollama gemma4:26b

eugr · April 3, 2026, 6:32am

what quant?

say3 · April 3, 2026, 6:35am

gemma4:26b this one Q4_K_M

eugr · April 3, 2026, 6:40am

Performance drops very rapidly with context, to the point where vLLM at FP8 beats it on inference speed:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
google/gemma-4-26B-A4B-it	pp2048	5907.78 ± 245.40		353.25 ± 14.87	347.45 ± 14.87	353.42 ± 14.93
google/gemma-4-26B-A4B-it	tg32	38.92 ± 0.77	40.19 ± 0.80
google/gemma-4-26B-A4B-it	ctx_pp @ d8192	5053.17 ± 95.71		1765.15 ± 29.46	1759.34 ± 29.46	1765.37 ± 29.52
google/gemma-4-26B-A4B-it	ctx_tg @ d8192	37.62 ± 0.05	38.84 ± 0.05
google/gemma-4-26B-A4B-it	pp2048 @ d8192	3411.45 ± 212.01		608.56 ± 39.05	602.76 ± 39.05	608.70 ± 39.05
google/gemma-4-26B-A4B-it	tg32 @ d8192	38.67 ± 1.72	39.94 ± 1.79
google/gemma-4-26B-A4B-it	ctx_pp @ d32768	2596.82 ± 1.32		13745.31 ± 5.66	13739.51 ± 5.66	13746.22 ± 6.60
google/gemma-4-26B-A4B-it	ctx_tg @ d32768	36.36 ± 0.06	37.55 ± 0.05
google/gemma-4-26B-A4B-it	pp2048 @ d32768	1293.68 ± 42.37		1590.62 ± 52.77	1584.81 ± 52.77	1590.79 ± 52.72
google/gemma-4-26B-A4B-it	tg32 @ d32768	37.82 ± 2.31	39.05 ± 2.39

llama-benchy (0.3.6.dev12+g5e7b509cb)
date: 2026-04-02 23:37:49 | latency mode: api

Have you tried with llama.cpp directly?

say3 · April 3, 2026, 6:42am

Not yet. Last night ollama is the only choice to run it

eugr · April 3, 2026, 6:48am

Running on two Sparks:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
google/gemma-4-26B-A4B-it	pp2048	5530.79 ± 2199.18		464.47 ± 228.36	459.00 ± 228.36	464.62 ± 228.45
google/gemma-4-26B-A4B-it	tg32	55.34 ± 0.32	57.13 ± 0.33
google/gemma-4-26B-A4B-it	ctx_pp @ d8192	7072.22 ± 19.08		1256.01 ± 7.87	1250.55 ± 7.87	1256.21 ± 7.93
google/gemma-4-26B-A4B-it	ctx_tg @ d8192	54.21 ± 0.63	55.98 ± 0.64
google/gemma-4-26B-A4B-it	pp2048 @ d8192	5177.12 ± 15.03		401.06 ± 1.15	395.59 ± 1.15	401.38 ± 1.20
google/gemma-4-26B-A4B-it	tg32 @ d8192	53.53 ± 0.17	55.28 ± 0.17
google/gemma-4-26B-A4B-it	ctx_pp @ d32768	4313.47 ± 17.70		8260.43 ± 47.37	8254.96 ± 47.37	8260.56 ± 47.32
google/gemma-4-26B-A4B-it	ctx_tg @ d32768	52.11 ± 0.33	53.81 ± 0.34
google/gemma-4-26B-A4B-it	pp2048 @ d32768	2236.26 ± 11.96		921.31 ± 4.89	915.84 ± 4.89	921.47 ± 4.83
google/gemma-4-26B-A4B-it	tg32 @ d32768	52.69 ± 0.45	54.42 ± 0.48

llama-benchy (0.3.6.dev12+g5e7b509cb)
date: 2026-04-02 23:47:59 | latency mode: api | pp basis: ttfr

eugr · April 3, 2026, 6:54am

Pushed a recipe to the repo:

Single Spark:


./run-recipe.sh gemma4-26b-a4b --solo

Dual Sparks:


./run-recipe.sh gemma4-26b-a4b --no-ray

eugr · April 3, 2026, 7:03am

Interestingly, cyankiwi AWQ quant doesn’t perform better:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit	pp2048	5315.15 ± 527.89		395.76 ± 41.54	389.63 ± 41.54	395.94 ± 41.49
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit	tg32	40.45 ± 0.03	41.76 ± 0.03
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit	ctx_pp @ d8192	4938.54 ± 21.54		1803.62 ± 13.36	1797.49 ± 13.36	1803.90 ± 13.62
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit	ctx_tg @ d8192	39.46 ± 0.01	40.74 ± 0.01
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit	pp2048 @ d8192	3326.77 ± 30.01		621.80 ± 5.59	615.66 ± 5.59	621.93 ± 5.56
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit	tg32 @ d8192	39.29 ± 0.07	40.56 ± 0.08
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit	ctx_pp @ d32768	2493.71 ± 5.39		14311.78 ± 89.36	14305.65 ± 89.36	14312.65 ± 90.18
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit	ctx_tg @ d32768	37.40 ± 0.11	38.62 ± 0.11
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit	pp2048 @ d32768	1270.76 ± 42.60		1619.63 ± 55.41	1613.50 ± 55.41	1620.09 ± 55.20
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit	tg32 @ d32768	39.23 ± 2.59	40.50 ± 2.68

llama-benchy (0.3.6.dev12+g5e7b509cb)
date: 2026-04-02 23:52:43 | latency mode: api | pp basis: ttfr

giraudremi92 · April 3, 2026, 7:36am

Thanks @eugr !!

I will try that today or this weekend.

Waiting the intel autoround version as well :)

honore.c · April 3, 2026, 7:47am

Did you see the “thinking” process on your side ?

Even with OpenWebUI or an agent, the “thinking” never appears just like it is not activated. Or maybe it’s just not properly parsed and is not show.

cosinus · April 3, 2026, 8:03am

Which variant are you using? vLLM, llama.cpp?

For vLLM you need to add the correct reasoning parser. See eugr’s recipe.

honore.c · April 3, 2026, 8:06am

Yep vLLM and I have set the reasoning parser.

cosinus · April 3, 2026, 8:10am

OK. Just started to test my self, but neither Open WebUI nor Cherry Studio AI are showing a thinking process.

Hmm. May be just a template issue or Gemma 4 needs additional parameters in the chat requests.

serapis · April 3, 2026, 8:16am

Dual Spark Setup with --no-ray via eugr’s stack. Sharing some results.

TL;DR: I’d go with the cyankiwi dense model and the Google release with FP8 for the MoE.

Gemma 4 26B A4B - Comparison cyankiwi AWQ-4bit vs FP8 quantization:

cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

| model   |            test |               t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:--------|----------------:|------------------:|-------------:|----------------:|----------------:|----------------:|
| gemma4  |          pp2048 | 5119.07 ± 1583.55 |              | 415.54 ± 169.37 | 413.66 ± 169.37 | 415.60 ± 169.38 |
| gemma4  |           tg128 |      52.81 ± 0.18 | 54.00 ± 0.00 |                 |                 |                 |
| gemma4  |  pp2048 @ d4096 |   6424.47 ± 14.86 |              |   914.08 ± 5.88 |   912.20 ± 5.88 |   914.14 ± 5.88 |
| gemma4  |   tg128 @ d4096 |      47.92 ± 5.56 | 52.67 ± 0.47 |                 |                 |                 |
| gemma4  |  pp2048 @ d8192 |  5816.55 ± 115.51 |              | 1690.48 ± 27.83 | 1688.60 ± 27.83 | 1690.53 ± 27.83 |
| gemma4  |   tg128 @ d8192 |      50.77 ± 0.24 | 51.33 ± 0.47 |                 |                 |                 |
| gemma4  | pp2048 @ d16384 |   5334.19 ± 11.14 |              | 3364.79 ± 13.39 | 3362.91 ± 13.39 | 3364.88 ± 13.39 |
| gemma4  |  tg128 @ d16384 |      49.31 ± 0.16 | 50.00 ± 0.00 |                 |                 |                 |

llama-benchy (0.3.5)
date: 2026-04-03 10:10:31 | latency mode: api

google/gemma-4-26B-A4B-it

| model   |            test |               t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:--------|----------------:|------------------:|-------------:|----------------:|----------------:|----------------:|
| gemma4  |          pp2048 |  7626.50 ± 191.09 |              |  251.39 ± 11.62 |  248.79 ± 11.62 |  251.45 ± 11.64 |
| gemma4  |           tg128 |      57.26 ± 0.21 | 58.00 ± 0.00 |                 |                 |                 |
| gemma4  |  pp2048 @ d4096 | 7044.02 ± 1496.52 |              | 882.38 ± 216.67 | 879.78 ± 216.67 | 882.43 ± 216.67 |
| gemma4  |   tg128 @ d4096 |      55.52 ± 0.04 | 56.00 ± 0.00 |                 |                 |                 |
| gemma4  |  pp2048 @ d8192 |  7103.98 ± 176.92 |              | 1397.95 ± 28.14 | 1395.35 ± 28.14 | 1398.01 ± 28.15 |
| gemma4  |   tg128 @ d8192 |      54.55 ± 0.11 | 55.00 ± 0.00 |                 |                 |                 |
| gemma4  | pp2048 @ d16384 |   6167.66 ± 19.81 |              | 2906.34 ± 13.57 | 2903.74 ± 13.57 | 2906.39 ± 13.58 |
| gemma4  |  tg128 @ d16384 |      53.63 ± 0.10 | 54.33 ± 0.47 |                 |                 |                 |

llama-benchy (0.3.5)
date: 2026-04-03 10:01:58 | latency mode: api

Gemma 4 31B - Comparison cyankiwi AWQ-4bit vs FP8 quantization:

cyankiwi/gemma-4-31B-it-AWQ-4bit

| model   |            test |             t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:--------|----------------:|----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| gemma4  |          pp2048 | 1509.32 ± 28.16 |              |  1249.71 ± 57.50 |  1248.16 ± 57.50 |  1249.77 ± 57.49 |
| gemma4  |           tg128 |    18.70 ± 0.02 | 19.00 ± 0.00 |                  |                  |                  |
| gemma4  |  pp2048 @ d4096 |  1445.17 ± 9.64 |              |  4081.66 ± 38.36 |  4080.12 ± 38.36 |  4081.72 ± 38.37 |
| gemma4  |   tg128 @ d4096 |    18.43 ± 0.04 | 19.00 ± 0.00 |                  |                  |                  |
| gemma4  |  pp2048 @ d8192 |  1366.90 ± 0.06 |              |  7264.20 ± 23.99 |  7262.66 ± 23.99 |  7264.28 ± 23.98 |
| gemma4  |   tg128 @ d8192 |    18.12 ± 0.04 | 19.00 ± 0.00 |                  |                  |                  |
| gemma4  | pp2048 @ d16384 |  1190.44 ± 1.30 |              | 15018.70 ± 12.20 | 15017.16 ± 12.20 | 15018.76 ± 12.20 |
| gemma4  |  tg128 @ d16384 |    17.71 ± 0.05 | 18.00 ± 0.00 |                  |                  |                  |

llama-benchy (0.3.5)
date: 2026-04-03 05:22:59 | latency mode: api

google/gemma-4-31B-it

| model   |            test |              t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:--------|----------------:|-----------------:|-------------:|------------------:|------------------:|------------------:|
| gemma4  |          pp2048 | 2716.85 ± 111.27 |              |    704.67 ± 33.60 |    701.20 ± 33.60 |    704.74 ± 33.59 |
| gemma4  |           tg128 |     12.31 ± 0.05 | 13.00 ± 0.00 |                   |                   |                   |
| gemma4  |  pp2048 @ d4096 |   1347.25 ± 9.61 |              |   4325.34 ± 67.74 |   4321.87 ± 67.74 |   4325.41 ± 67.74 |
| gemma4  |   tg128 @ d4096 |     12.25 ± 0.01 | 13.00 ± 0.00 |                   |                   |                   |
| gemma4  |  pp2048 @ d8192 |   1340.52 ± 8.04 |              |   7336.79 ± 24.76 |   7333.33 ± 24.76 |   7336.89 ± 24.77 |
| gemma4  |   tg128 @ d8192 |     12.11 ± 0.00 | 13.00 ± 0.00 |                   |                   |                   |
| gemma4  | pp2048 @ d16384 |  1125.52 ± 24.89 |              | 15893.75 ± 419.09 | 15890.28 ± 419.09 | 15893.84 ± 419.07 |
| gemma4  |  tg128 @ d16384 |     11.92 ± 0.04 | 12.67 ± 0.47 |                   |                   |                   |

llama-benchy (0.3.5)
date: 2026-04-03 10:22:49 | latency mode: api

cosinus · April 3, 2026, 8:17am

Trigger Thinking: Thinking is enabled by including the <|think|> token at the start of the system prompt. To disable thinking, remove the token.

Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
<|channel>thought\n [Internal reasoning] <channel|>

As far as I know Open WebUI and Cherry Studio expect <think> </think> tags. Never seen this <|channel><channel|> variant before.

Topic		Replies	Views
Gemma 4 Day-1 Inference on NVIDIA DGX Spark — Preliminary Benchmarks DGX Spark / GB10 llama , agentic-ai	10	2152	April 3, 2026
Google Gemma 4 - It will work on DGX Spark? DGX Spark / GB10 agentic-ai	21	1191	April 5, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4098	February 27, 2026
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	12	1949	January 22, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	3681	January 2, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2656	December 31, 2025
Someone post this: Gemma 4 26B-A4B MoE running at 45-60 tok/s on DGX Spark DGX Spark / GB10	4	531	April 5, 2026
How to run Gemma-4-NVFP4 in vLLM Docker? DGX Spark / GB10	9	880	April 4, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	43	7189	April 3, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	13915	March 24, 2026

Gemma 4 Models - which vLLM version? Any PRs spotted?

Gemma-4-26B (A4B) – llama-benchy Results Summary

🔑 Key Findings (with plain-English explanations)

🧠 Bottom Line

💡 Practical Implications

Related topics