Gemma 4 Models - which vLLM version? Any PRs spotted?

You should be getting better performance, even with a dense model.
I suggest to remove all the variables you set and remove --quantization compressed_tensors - not sure why would you put it there.

 | model                       |   test |              t/s |    peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
 |:----------------------------|-------:|-----------------:|------------:|----------------:|----------------:|----------------:|
 | nvidia/Gemma-4-31B-IT-NVFP4 |  pp128 |  589.66 ± 110.11 |             |  231.35 ± 49.04 |  227.93 ± 49.04 |  231.40 ± 49.05 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |   tg32 |      6.91 ± 0.00 | 7.00 ± 0.00 |                 |                 |                 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  pp128 |    674.10 ± 2.91 |             |   194.79 ± 0.83 |   191.37 ± 0.83 |   194.84 ± 0.83 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  tg128 |      6.88 ± 0.04 | 7.00 ± 0.00 |                 |                 |                 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  pp512 | 1415.36 ± 216.58 |             |  375.64 ± 63.87 |  372.23 ± 63.87 |  375.69 ± 63.87 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |   tg32 |      6.87 ± 0.00 | 7.00 ± 0.00 |                 |                 |                 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  pp512 | 1412.71 ± 213.50 |             |  376.08 ± 63.06 |  372.66 ± 63.06 |  376.13 ± 63.06 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  tg128 |      6.82 ± 0.04 | 7.00 ± 0.00 |                 |                 |                 |
 | nvidia/Gemma-4-31B-IT-NVFP4 | pp2048 |  1717.18 ± 83.13 |             | 1199.56 ± 59.96 | 1196.14 ± 59.96 | 1199.62 ± 59.96 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |   tg32 |      6.80 ± 0.01 | 7.00 ± 0.00 |                 |                 |                 |
 | nvidia/Gemma-4-31B-IT-NVFP4 | pp2048 |   1771.15 ± 2.44 |             |  1160.30 ± 1.59 |  1156.88 ± 1.59 |  1160.37 ± 1.60 |
 | nvidia/Gemma-4-31B-IT-NVFP4 |  tg128 |      6.81 ± 0.01 | 7.00 ± 0.00 |                 |                 |                 |

 llama-benchy (0.3.3)
 date: 2026-04-03 07:00:52 | latency mode: api

Using https://github.com/johnnynunez/vllm/commits/main/

captonn / cyanwiki also published his AWQ quants in 4bit:

and even an 8bit version for the dense one:

will test them later this morning.

Here are my llama-benchy results (with AI-assisted explanation for those who are’t experts (yet!)):

vLLM run as follows:

./launch-cluster.sh -t vllm-node-20260402-tf5-pr35568 --solo
exec vllm serve google/gemma-4-26B-A4B-it
–max-model-len auto
–gpu-memory-utilization 0.7
–enable-auto-tool-choice
–tool-call-parser gemma4
–reasoning-parser gemma4
–load-format fastsafetensors
–quantization fp8
–kv-cache-dtype fp8

Gemma-4-26B (A4B) – llama-benchy Results Summary

Ran benchmarks against an OpenAI-compatible endpoint (vLLM), single user (concurrency=1), testing different prompt sizes and context lengths up to ~229k tokens.


🔑 Key Findings (with plain-English explanations)

1. Output speed (“decode”) is strong and stable
👉 This is how fast the model generates tokens after it starts responding (i.e., the actual answer speed).

  • ~38 tokens/sec at small context

  • Drops gradually to ~25 tokens/sec at ~229k context

Takeaway: Once the model starts answering, it’s consistently fast, even with large inputs.


2. Input processing (“prefill”) slows down a lot as context grows
👉 This is the time spent reading and understanding your input before it starts generating an answer.

  • ~4600 tokens/sec at small context

  • ~1100 t/s at 32k

  • ~160 t/s at 229k

Takeaway: The bigger your prompt/history, the slower the model is at getting ready to answer.


3. Time-to-first-token (TTFT) becomes the real problem
👉 This is how long you wait before seeing the first word of the response.

  • ~1.8s @ 8k

  • ~14.5s @ 32k

  • ~49s @ 65k

  • ~180s @ 131k

  • ~556s (~9 minutes) @ 229k

Takeaway:
Even though output speed is fine, startup delay becomes massive with large inputs → kills interactivity.


4. “Long context” = very large inputs (tens to hundreds of thousands of tokens)
👉 Think: huge chat histories, full documents, or large RAG payloads.

  • Performance drops are dominated by input processing + memory overhead, not output speed

Takeaway: Long-context workloads are expensive and slow to start, even if generation itself is okay.


5. Prefix caching didn’t noticeably help
👉 This feature is supposed to reuse previously processed input to speed things up.

  • Context load times still very high

Takeaway: Don’t assume caching will fix long input delays (at least in this setup). Some of this may be a test-artifact and needs more analysis.


🧠 Bottom Line

  • Fast answers once generation starts (~30–40 t/s)

  • Big delays before answers at larger input sizes

  • >100k token inputs are not interactive (minutes of wait time)

  • The main bottleneck is reading the input, not generating the output


💡 Practical Implications

  • Works well for:

    • Chat, agents, typical prompts (<32k tokens)
  • Gets slow for:

    • Large RAG contexts, long documents
  • Not practical for:

    • Interactive use with very large inputs (100k+ tokens)

      I’m going to run OpenClaw with 32k-48k and see how that goes. Will share findings when I have them.

1 Like

Can you post raw llama-benchy results?
Also, prefix caching may not be enabled by default for this model - you need to specify --enable-prefix-caching in vLLM parameters.

| model      |            test |             t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:-----------|----------------:|----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| gemma4:26b |          pp2048 | 1931.65 ± 24.60 |              |   1353.72 ± 8.83 |    961.42 ± 8.83 |   1353.72 ± 8.83 |
| gemma4:26b |            tg32 |    56.38 ± 1.07 | 58.47 ± 1.07 |                  |                  |                  |
| gemma4:26b |  pp2048 @ d4096 | 1918.86 ± 24.95 |              |  3345.81 ± 14.52 |  2953.51 ± 14.52 |  3345.81 ± 14.52 |
| gemma4:26b |    tg32 @ d4096 |    52.70 ± 4.80 | 54.63 ± 4.99 |                  |                  |                  |
| gemma4:26b |  pp2048 @ d8192 | 1941.17 ± 16.35 |              |  5219.77 ± 69.05 |  4827.47 ± 69.05 |  5219.77 ± 69.05 |
| gemma4:26b |    tg32 @ d8192 |    45.44 ± 0.85 | 47.09 ± 0.86 |                  |                  |                  |
| gemma4:26b | pp2048 @ d16384 |  1774.99 ± 6.09 |              |  9985.83 ± 46.53 |  9593.53 ± 46.53 |  9985.83 ± 46.53 |
| gemma4:26b |   tg32 @ d16384 |    46.22 ± 2.27 | 48.00 ± 2.30 |                  |                  |                  |
| gemma4:26b | pp2048 @ d32768 |  1542.44 ± 3.68 |              | 21115.29 ± 87.25 | 20722.98 ± 87.25 | 21115.29 ± 87.25 |
| gemma4:26b |   tg32 @ d32768 |    24.67 ± 0.37 | 25.33 ± 0.47 |                  |                  |                  |

llama-benchy (0.3.5)
date: 2026-04-03 14:15:18 | latency mode: generation

Benchmark on Ollama gemma4:26b

1 Like

what quant?

gemma4:26b this one Q4_K_M

Performance drops very rapidly with context, to the point where vLLM at FP8 beats it on inference speed:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
google/gemma-4-26B-A4B-it pp2048 5907.78 ± 245.40 353.25 ± 14.87 347.45 ± 14.87 353.42 ± 14.93
google/gemma-4-26B-A4B-it tg32 38.92 ± 0.77 40.19 ± 0.80
google/gemma-4-26B-A4B-it ctx_pp @ d8192 5053.17 ± 95.71 1765.15 ± 29.46 1759.34 ± 29.46 1765.37 ± 29.52
google/gemma-4-26B-A4B-it ctx_tg @ d8192 37.62 ± 0.05 38.84 ± 0.05
google/gemma-4-26B-A4B-it pp2048 @ d8192 3411.45 ± 212.01 608.56 ± 39.05 602.76 ± 39.05 608.70 ± 39.05
google/gemma-4-26B-A4B-it tg32 @ d8192 38.67 ± 1.72 39.94 ± 1.79
google/gemma-4-26B-A4B-it ctx_pp @ d32768 2596.82 ± 1.32 13745.31 ± 5.66 13739.51 ± 5.66 13746.22 ± 6.60
google/gemma-4-26B-A4B-it ctx_tg @ d32768 36.36 ± 0.06 37.55 ± 0.05
google/gemma-4-26B-A4B-it pp2048 @ d32768 1293.68 ± 42.37 1590.62 ± 52.77 1584.81 ± 52.77 1590.79 ± 52.72
google/gemma-4-26B-A4B-it tg32 @ d32768 37.82 ± 2.31 39.05 ± 2.39

llama-benchy (0.3.6.dev12+g5e7b509cb)
date: 2026-04-02 23:37:49 | latency mode: api

Have you tried with llama.cpp directly?

Not yet. Last night ollama is the only choice to run it

Running on two Sparks:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
google/gemma-4-26B-A4B-it pp2048 5530.79 ± 2199.18 464.47 ± 228.36 459.00 ± 228.36 464.62 ± 228.45
google/gemma-4-26B-A4B-it tg32 55.34 ± 0.32 57.13 ± 0.33
google/gemma-4-26B-A4B-it ctx_pp @ d8192 7072.22 ± 19.08 1256.01 ± 7.87 1250.55 ± 7.87 1256.21 ± 7.93
google/gemma-4-26B-A4B-it ctx_tg @ d8192 54.21 ± 0.63 55.98 ± 0.64
google/gemma-4-26B-A4B-it pp2048 @ d8192 5177.12 ± 15.03 401.06 ± 1.15 395.59 ± 1.15 401.38 ± 1.20
google/gemma-4-26B-A4B-it tg32 @ d8192 53.53 ± 0.17 55.28 ± 0.17
google/gemma-4-26B-A4B-it ctx_pp @ d32768 4313.47 ± 17.70 8260.43 ± 47.37 8254.96 ± 47.37 8260.56 ± 47.32
google/gemma-4-26B-A4B-it ctx_tg @ d32768 52.11 ± 0.33 53.81 ± 0.34
google/gemma-4-26B-A4B-it pp2048 @ d32768 2236.26 ± 11.96 921.31 ± 4.89 915.84 ± 4.89 921.47 ± 4.83
google/gemma-4-26B-A4B-it tg32 @ d32768 52.69 ± 0.45 54.42 ± 0.48

llama-benchy (0.3.6.dev12+g5e7b509cb)
date: 2026-04-02 23:47:59 | latency mode: api | pp basis: ttfr

1 Like

Pushed a recipe to the repo:

Single Spark:


./run-recipe.sh gemma4-26b-a4b --solo

Dual Sparks:


./run-recipe.sh gemma4-26b-a4b --no-ray

9 Likes

Interestingly, cyankiwi AWQ quant doesn’t perform better:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit pp2048 5315.15 ± 527.89 395.76 ± 41.54 389.63 ± 41.54 395.94 ± 41.49
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit tg32 40.45 ± 0.03 41.76 ± 0.03
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit ctx_pp @ d8192 4938.54 ± 21.54 1803.62 ± 13.36 1797.49 ± 13.36 1803.90 ± 13.62
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit ctx_tg @ d8192 39.46 ± 0.01 40.74 ± 0.01
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit pp2048 @ d8192 3326.77 ± 30.01 621.80 ± 5.59 615.66 ± 5.59 621.93 ± 5.56
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit tg32 @ d8192 39.29 ± 0.07 40.56 ± 0.08
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit ctx_pp @ d32768 2493.71 ± 5.39 14311.78 ± 89.36 14305.65 ± 89.36 14312.65 ± 90.18
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit ctx_tg @ d32768 37.40 ± 0.11 38.62 ± 0.11
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit pp2048 @ d32768 1270.76 ± 42.60 1619.63 ± 55.41 1613.50 ± 55.41 1620.09 ± 55.20
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit tg32 @ d32768 39.23 ± 2.59 40.50 ± 2.68

llama-benchy (0.3.6.dev12+g5e7b509cb)
date: 2026-04-02 23:52:43 | latency mode: api | pp basis: ttfr

2 Likes

Thanks @eugr !!

I will try that today or this weekend.

Waiting the intel autoround version as well :)

Did you see the “thinking” process on your side ?

Even with OpenWebUI or an agent, the “thinking” never appears just like it is not activated. Or maybe it’s just not properly parsed and is not show.

Which variant are you using? vLLM, llama.cpp?

For vLLM you need to add the correct reasoning parser. See eugr’s recipe.

Yep vLLM and I have set the reasoning parser.

OK. Just started to test my self, but neither Open WebUI nor Cherry Studio AI are showing a thinking process.

Hmm. May be just a template issue or Gemma 4 needs additional parameters in the chat requests.

Dual Spark Setup with --no-ray via eugr’s stack. Sharing some results.

TL;DR: I’d go with the cyankiwi dense model and the Google release with FP8 for the MoE.

Gemma 4 26B A4B - Comparison cyankiwi AWQ-4bit vs FP8 quantization:

cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

| model   |            test |               t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:--------|----------------:|------------------:|-------------:|----------------:|----------------:|----------------:|
| gemma4  |          pp2048 | 5119.07 ± 1583.55 |              | 415.54 ± 169.37 | 413.66 ± 169.37 | 415.60 ± 169.38 |
| gemma4  |           tg128 |      52.81 ± 0.18 | 54.00 ± 0.00 |                 |                 |                 |
| gemma4  |  pp2048 @ d4096 |   6424.47 ± 14.86 |              |   914.08 ± 5.88 |   912.20 ± 5.88 |   914.14 ± 5.88 |
| gemma4  |   tg128 @ d4096 |      47.92 ± 5.56 | 52.67 ± 0.47 |                 |                 |                 |
| gemma4  |  pp2048 @ d8192 |  5816.55 ± 115.51 |              | 1690.48 ± 27.83 | 1688.60 ± 27.83 | 1690.53 ± 27.83 |
| gemma4  |   tg128 @ d8192 |      50.77 ± 0.24 | 51.33 ± 0.47 |                 |                 |                 |
| gemma4  | pp2048 @ d16384 |   5334.19 ± 11.14 |              | 3364.79 ± 13.39 | 3362.91 ± 13.39 | 3364.88 ± 13.39 |
| gemma4  |  tg128 @ d16384 |      49.31 ± 0.16 | 50.00 ± 0.00 |                 |                 |                 |

llama-benchy (0.3.5)
date: 2026-04-03 10:10:31 | latency mode: api
google/gemma-4-26B-A4B-it

| model   |            test |               t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:--------|----------------:|------------------:|-------------:|----------------:|----------------:|----------------:|
| gemma4  |          pp2048 |  7626.50 ± 191.09 |              |  251.39 ± 11.62 |  248.79 ± 11.62 |  251.45 ± 11.64 |
| gemma4  |           tg128 |      57.26 ± 0.21 | 58.00 ± 0.00 |                 |                 |                 |
| gemma4  |  pp2048 @ d4096 | 7044.02 ± 1496.52 |              | 882.38 ± 216.67 | 879.78 ± 216.67 | 882.43 ± 216.67 |
| gemma4  |   tg128 @ d4096 |      55.52 ± 0.04 | 56.00 ± 0.00 |                 |                 |                 |
| gemma4  |  pp2048 @ d8192 |  7103.98 ± 176.92 |              | 1397.95 ± 28.14 | 1395.35 ± 28.14 | 1398.01 ± 28.15 |
| gemma4  |   tg128 @ d8192 |      54.55 ± 0.11 | 55.00 ± 0.00 |                 |                 |                 |
| gemma4  | pp2048 @ d16384 |   6167.66 ± 19.81 |              | 2906.34 ± 13.57 | 2903.74 ± 13.57 | 2906.39 ± 13.58 |
| gemma4  |  tg128 @ d16384 |      53.63 ± 0.10 | 54.33 ± 0.47 |                 |                 |                 |

llama-benchy (0.3.5)
date: 2026-04-03 10:01:58 | latency mode: api

Gemma 4 31B - Comparison cyankiwi AWQ-4bit vs FP8 quantization:

cyankiwi/gemma-4-31B-it-AWQ-4bit

| model   |            test |             t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:--------|----------------:|----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| gemma4  |          pp2048 | 1509.32 ± 28.16 |              |  1249.71 ± 57.50 |  1248.16 ± 57.50 |  1249.77 ± 57.49 |
| gemma4  |           tg128 |    18.70 ± 0.02 | 19.00 ± 0.00 |                  |                  |                  |
| gemma4  |  pp2048 @ d4096 |  1445.17 ± 9.64 |              |  4081.66 ± 38.36 |  4080.12 ± 38.36 |  4081.72 ± 38.37 |
| gemma4  |   tg128 @ d4096 |    18.43 ± 0.04 | 19.00 ± 0.00 |                  |                  |                  |
| gemma4  |  pp2048 @ d8192 |  1366.90 ± 0.06 |              |  7264.20 ± 23.99 |  7262.66 ± 23.99 |  7264.28 ± 23.98 |
| gemma4  |   tg128 @ d8192 |    18.12 ± 0.04 | 19.00 ± 0.00 |                  |                  |                  |
| gemma4  | pp2048 @ d16384 |  1190.44 ± 1.30 |              | 15018.70 ± 12.20 | 15017.16 ± 12.20 | 15018.76 ± 12.20 |
| gemma4  |  tg128 @ d16384 |    17.71 ± 0.05 | 18.00 ± 0.00 |                  |                  |                  |

llama-benchy (0.3.5)
date: 2026-04-03 05:22:59 | latency mode: api
google/gemma-4-31B-it

| model   |            test |              t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:--------|----------------:|-----------------:|-------------:|------------------:|------------------:|------------------:|
| gemma4  |          pp2048 | 2716.85 ± 111.27 |              |    704.67 ± 33.60 |    701.20 ± 33.60 |    704.74 ± 33.59 |
| gemma4  |           tg128 |     12.31 ± 0.05 | 13.00 ± 0.00 |                   |                   |                   |
| gemma4  |  pp2048 @ d4096 |   1347.25 ± 9.61 |              |   4325.34 ± 67.74 |   4321.87 ± 67.74 |   4325.41 ± 67.74 |
| gemma4  |   tg128 @ d4096 |     12.25 ± 0.01 | 13.00 ± 0.00 |                   |                   |                   |
| gemma4  |  pp2048 @ d8192 |   1340.52 ± 8.04 |              |   7336.79 ± 24.76 |   7333.33 ± 24.76 |   7336.89 ± 24.77 |
| gemma4  |   tg128 @ d8192 |     12.11 ± 0.00 | 13.00 ± 0.00 |                   |                   |                   |
| gemma4  | pp2048 @ d16384 |  1125.52 ± 24.89 |              | 15893.75 ± 419.09 | 15890.28 ± 419.09 | 15893.84 ± 419.07 |
| gemma4  |  tg128 @ d16384 |     11.92 ± 0.04 | 12.67 ± 0.47 |                   |                   |                   |

llama-benchy (0.3.5)
date: 2026-04-03 10:22:49 | latency mode: api
1 Like

Trigger Thinking: Thinking is enabled by including the <|think|> token at the start of the system prompt. To disable thinking, remove the token.

Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
<|channel>thought\n [Internal reasoning] <channel|>

As far as I know Open WebUI and Cherry Studio expect <think> </think> tags. Never seen this <|channel><channel|> variant before.