You should be getting better performance, even with a dense model.
I suggest to remove all the variables you set and remove --quantization compressed_tensors - not sure why would you put it there.
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:----------------------------|-------:|-----------------:|------------:|----------------:|----------------:|----------------:|
| nvidia/Gemma-4-31B-IT-NVFP4 | pp128 | 589.66 ± 110.11 | | 231.35 ± 49.04 | 227.93 ± 49.04 | 231.40 ± 49.05 |
| nvidia/Gemma-4-31B-IT-NVFP4 | tg32 | 6.91 ± 0.00 | 7.00 ± 0.00 | | | |
| nvidia/Gemma-4-31B-IT-NVFP4 | pp128 | 674.10 ± 2.91 | | 194.79 ± 0.83 | 191.37 ± 0.83 | 194.84 ± 0.83 |
| nvidia/Gemma-4-31B-IT-NVFP4 | tg128 | 6.88 ± 0.04 | 7.00 ± 0.00 | | | |
| nvidia/Gemma-4-31B-IT-NVFP4 | pp512 | 1415.36 ± 216.58 | | 375.64 ± 63.87 | 372.23 ± 63.87 | 375.69 ± 63.87 |
| nvidia/Gemma-4-31B-IT-NVFP4 | tg32 | 6.87 ± 0.00 | 7.00 ± 0.00 | | | |
| nvidia/Gemma-4-31B-IT-NVFP4 | pp512 | 1412.71 ± 213.50 | | 376.08 ± 63.06 | 372.66 ± 63.06 | 376.13 ± 63.06 |
| nvidia/Gemma-4-31B-IT-NVFP4 | tg128 | 6.82 ± 0.04 | 7.00 ± 0.00 | | | |
| nvidia/Gemma-4-31B-IT-NVFP4 | pp2048 | 1717.18 ± 83.13 | | 1199.56 ± 59.96 | 1196.14 ± 59.96 | 1199.62 ± 59.96 |
| nvidia/Gemma-4-31B-IT-NVFP4 | tg32 | 6.80 ± 0.01 | 7.00 ± 0.00 | | | |
| nvidia/Gemma-4-31B-IT-NVFP4 | pp2048 | 1771.15 ± 2.44 | | 1160.30 ± 1.59 | 1156.88 ± 1.59 | 1160.37 ± 1.60 |
| nvidia/Gemma-4-31B-IT-NVFP4 | tg128 | 6.81 ± 0.01 | 7.00 ± 0.00 | | | |
llama-benchy (0.3.3)
date: 2026-04-03 07:00:52 | latency mode: api
captonn / cyanwiki also published his AWQ quants in 4bit:
and even an 8bit version for the dense one:
will test them later this morning.
Here are my llama-benchy results (with AI-assisted explanation for those who are’t experts (yet!)):
vLLM run as follows:
./launch-cluster.sh -t vllm-node-20260402-tf5-pr35568 --solo
exec vllm serve google/gemma-4-26B-A4B-it
–max-model-len auto
–gpu-memory-utilization 0.7
–enable-auto-tool-choice
–tool-call-parser gemma4
–reasoning-parser gemma4
–load-format fastsafetensors
–quantization fp8
–kv-cache-dtype fp8
Gemma-4-26B (A4B) – llama-benchy Results Summary
Ran benchmarks against an OpenAI-compatible endpoint (vLLM), single user (concurrency=1), testing different prompt sizes and context lengths up to ~229k tokens.
🔑 Key Findings (with plain-English explanations)
1. Output speed (“decode”) is strong and stable
👉 This is how fast the model generates tokens after it starts responding (i.e., the actual answer speed).
-
~38 tokens/sec at small context
-
Drops gradually to ~25 tokens/sec at ~229k context
Takeaway: Once the model starts answering, it’s consistently fast, even with large inputs.
2. Input processing (“prefill”) slows down a lot as context grows
👉 This is the time spent reading and understanding your input before it starts generating an answer.
-
~4600 tokens/sec at small context
-
~1100 t/s at 32k
-
~160 t/s at 229k
Takeaway: The bigger your prompt/history, the slower the model is at getting ready to answer.
3. Time-to-first-token (TTFT) becomes the real problem
👉 This is how long you wait before seeing the first word of the response.
-
~1.8s @ 8k
-
~14.5s @ 32k
-
~49s @ 65k
-
~180s @ 131k
-
~556s (~9 minutes) @ 229k
Takeaway:
Even though output speed is fine, startup delay becomes massive with large inputs → kills interactivity.
4. “Long context” = very large inputs (tens to hundreds of thousands of tokens)
👉 Think: huge chat histories, full documents, or large RAG payloads.
- Performance drops are dominated by input processing + memory overhead, not output speed
Takeaway: Long-context workloads are expensive and slow to start, even if generation itself is okay.
5. Prefix caching didn’t noticeably help
👉 This feature is supposed to reuse previously processed input to speed things up.
- Context load times still very high
Takeaway: Don’t assume caching will fix long input delays (at least in this setup). Some of this may be a test-artifact and needs more analysis.
🧠 Bottom Line
-
Fast answers once generation starts (~30–40 t/s)
-
Big delays before answers at larger input sizes
-
>100k token inputs are not interactive (minutes of wait time)
-
The main bottleneck is reading the input, not generating the output
💡 Practical Implications
-
Works well for:
- Chat, agents, typical prompts (<32k tokens)
-
Gets slow for:
- Large RAG contexts, long documents
-
Not practical for:
-
Interactive use with very large inputs (100k+ tokens)
I’m going to run OpenClaw with 32k-48k and see how that goes. Will share findings when I have them.
-
Can you post raw llama-benchy results?
Also, prefix caching may not be enabled by default for this model - you need to specify --enable-prefix-caching in vLLM parameters.
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-----------|----------------:|----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| gemma4:26b | pp2048 | 1931.65 ± 24.60 | | 1353.72 ± 8.83 | 961.42 ± 8.83 | 1353.72 ± 8.83 |
| gemma4:26b | tg32 | 56.38 ± 1.07 | 58.47 ± 1.07 | | | |
| gemma4:26b | pp2048 @ d4096 | 1918.86 ± 24.95 | | 3345.81 ± 14.52 | 2953.51 ± 14.52 | 3345.81 ± 14.52 |
| gemma4:26b | tg32 @ d4096 | 52.70 ± 4.80 | 54.63 ± 4.99 | | | |
| gemma4:26b | pp2048 @ d8192 | 1941.17 ± 16.35 | | 5219.77 ± 69.05 | 4827.47 ± 69.05 | 5219.77 ± 69.05 |
| gemma4:26b | tg32 @ d8192 | 45.44 ± 0.85 | 47.09 ± 0.86 | | | |
| gemma4:26b | pp2048 @ d16384 | 1774.99 ± 6.09 | | 9985.83 ± 46.53 | 9593.53 ± 46.53 | 9985.83 ± 46.53 |
| gemma4:26b | tg32 @ d16384 | 46.22 ± 2.27 | 48.00 ± 2.30 | | | |
| gemma4:26b | pp2048 @ d32768 | 1542.44 ± 3.68 | | 21115.29 ± 87.25 | 20722.98 ± 87.25 | 21115.29 ± 87.25 |
| gemma4:26b | tg32 @ d32768 | 24.67 ± 0.37 | 25.33 ± 0.47 | | | |
llama-benchy (0.3.5)
date: 2026-04-03 14:15:18 | latency mode: generation
Benchmark on Ollama gemma4:26b
what quant?
gemma4:26b this one Q4_K_M
Performance drops very rapidly with context, to the point where vLLM at FP8 beats it on inference speed:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| google/gemma-4-26B-A4B-it | pp2048 | 5907.78 ± 245.40 | 353.25 ± 14.87 | 347.45 ± 14.87 | 353.42 ± 14.93 | |
| google/gemma-4-26B-A4B-it | tg32 | 38.92 ± 0.77 | 40.19 ± 0.80 | |||
| google/gemma-4-26B-A4B-it | ctx_pp @ d8192 | 5053.17 ± 95.71 | 1765.15 ± 29.46 | 1759.34 ± 29.46 | 1765.37 ± 29.52 | |
| google/gemma-4-26B-A4B-it | ctx_tg @ d8192 | 37.62 ± 0.05 | 38.84 ± 0.05 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d8192 | 3411.45 ± 212.01 | 608.56 ± 39.05 | 602.76 ± 39.05 | 608.70 ± 39.05 | |
| google/gemma-4-26B-A4B-it | tg32 @ d8192 | 38.67 ± 1.72 | 39.94 ± 1.79 | |||
| google/gemma-4-26B-A4B-it | ctx_pp @ d32768 | 2596.82 ± 1.32 | 13745.31 ± 5.66 | 13739.51 ± 5.66 | 13746.22 ± 6.60 | |
| google/gemma-4-26B-A4B-it | ctx_tg @ d32768 | 36.36 ± 0.06 | 37.55 ± 0.05 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d32768 | 1293.68 ± 42.37 | 1590.62 ± 52.77 | 1584.81 ± 52.77 | 1590.79 ± 52.72 | |
| google/gemma-4-26B-A4B-it | tg32 @ d32768 | 37.82 ± 2.31 | 39.05 ± 2.39 |
llama-benchy (0.3.6.dev12+g5e7b509cb)
date: 2026-04-02 23:37:49 | latency mode: api
Have you tried with llama.cpp directly?
Not yet. Last night ollama is the only choice to run it
Running on two Sparks:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| google/gemma-4-26B-A4B-it | pp2048 | 5530.79 ± 2199.18 | 464.47 ± 228.36 | 459.00 ± 228.36 | 464.62 ± 228.45 | |
| google/gemma-4-26B-A4B-it | tg32 | 55.34 ± 0.32 | 57.13 ± 0.33 | |||
| google/gemma-4-26B-A4B-it | ctx_pp @ d8192 | 7072.22 ± 19.08 | 1256.01 ± 7.87 | 1250.55 ± 7.87 | 1256.21 ± 7.93 | |
| google/gemma-4-26B-A4B-it | ctx_tg @ d8192 | 54.21 ± 0.63 | 55.98 ± 0.64 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d8192 | 5177.12 ± 15.03 | 401.06 ± 1.15 | 395.59 ± 1.15 | 401.38 ± 1.20 | |
| google/gemma-4-26B-A4B-it | tg32 @ d8192 | 53.53 ± 0.17 | 55.28 ± 0.17 | |||
| google/gemma-4-26B-A4B-it | ctx_pp @ d32768 | 4313.47 ± 17.70 | 8260.43 ± 47.37 | 8254.96 ± 47.37 | 8260.56 ± 47.32 | |
| google/gemma-4-26B-A4B-it | ctx_tg @ d32768 | 52.11 ± 0.33 | 53.81 ± 0.34 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d32768 | 2236.26 ± 11.96 | 921.31 ± 4.89 | 915.84 ± 4.89 | 921.47 ± 4.83 | |
| google/gemma-4-26B-A4B-it | tg32 @ d32768 | 52.69 ± 0.45 | 54.42 ± 0.48 |
llama-benchy (0.3.6.dev12+g5e7b509cb)
date: 2026-04-02 23:47:59 | latency mode: api | pp basis: ttfr
Pushed a recipe to the repo:
Single Spark:
./run-recipe.sh gemma4-26b-a4b --solo
Dual Sparks:
./run-recipe.sh gemma4-26b-a4b --no-ray
Interestingly, cyankiwi AWQ quant doesn’t perform better:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | pp2048 | 5315.15 ± 527.89 | 395.76 ± 41.54 | 389.63 ± 41.54 | 395.94 ± 41.49 | |
| cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | tg32 | 40.45 ± 0.03 | 41.76 ± 0.03 | |||
| cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | ctx_pp @ d8192 | 4938.54 ± 21.54 | 1803.62 ± 13.36 | 1797.49 ± 13.36 | 1803.90 ± 13.62 | |
| cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | ctx_tg @ d8192 | 39.46 ± 0.01 | 40.74 ± 0.01 | |||
| cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | pp2048 @ d8192 | 3326.77 ± 30.01 | 621.80 ± 5.59 | 615.66 ± 5.59 | 621.93 ± 5.56 | |
| cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | tg32 @ d8192 | 39.29 ± 0.07 | 40.56 ± 0.08 | |||
| cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | ctx_pp @ d32768 | 2493.71 ± 5.39 | 14311.78 ± 89.36 | 14305.65 ± 89.36 | 14312.65 ± 90.18 | |
| cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | ctx_tg @ d32768 | 37.40 ± 0.11 | 38.62 ± 0.11 | |||
| cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | pp2048 @ d32768 | 1270.76 ± 42.60 | 1619.63 ± 55.41 | 1613.50 ± 55.41 | 1620.09 ± 55.20 | |
| cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | tg32 @ d32768 | 39.23 ± 2.59 | 40.50 ± 2.68 |
llama-benchy (0.3.6.dev12+g5e7b509cb)
date: 2026-04-02 23:52:43 | latency mode: api | pp basis: ttfr
Thanks @eugr !!
I will try that today or this weekend.
Waiting the intel autoround version as well :)
Did you see the “thinking” process on your side ?
Even with OpenWebUI or an agent, the “thinking” never appears just like it is not activated. Or maybe it’s just not properly parsed and is not show.
Which variant are you using? vLLM, llama.cpp?
For vLLM you need to add the correct reasoning parser. See eugr’s recipe.
Yep vLLM and I have set the reasoning parser.
OK. Just started to test my self, but neither Open WebUI nor Cherry Studio AI are showing a thinking process.
Hmm. May be just a template issue or Gemma 4 needs additional parameters in the chat requests.
Dual Spark Setup with --no-ray via eugr’s stack. Sharing some results.
TL;DR: I’d go with the cyankiwi dense model and the Google release with FP8 for the MoE.
Gemma 4 26B A4B - Comparison cyankiwi AWQ-4bit vs FP8 quantization:
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------|----------------:|------------------:|-------------:|----------------:|----------------:|----------------:|
| gemma4 | pp2048 | 5119.07 ± 1583.55 | | 415.54 ± 169.37 | 413.66 ± 169.37 | 415.60 ± 169.38 |
| gemma4 | tg128 | 52.81 ± 0.18 | 54.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d4096 | 6424.47 ± 14.86 | | 914.08 ± 5.88 | 912.20 ± 5.88 | 914.14 ± 5.88 |
| gemma4 | tg128 @ d4096 | 47.92 ± 5.56 | 52.67 ± 0.47 | | | |
| gemma4 | pp2048 @ d8192 | 5816.55 ± 115.51 | | 1690.48 ± 27.83 | 1688.60 ± 27.83 | 1690.53 ± 27.83 |
| gemma4 | tg128 @ d8192 | 50.77 ± 0.24 | 51.33 ± 0.47 | | | |
| gemma4 | pp2048 @ d16384 | 5334.19 ± 11.14 | | 3364.79 ± 13.39 | 3362.91 ± 13.39 | 3364.88 ± 13.39 |
| gemma4 | tg128 @ d16384 | 49.31 ± 0.16 | 50.00 ± 0.00 | | | |
llama-benchy (0.3.5)
date: 2026-04-03 10:10:31 | latency mode: api
google/gemma-4-26B-A4B-it
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------|----------------:|------------------:|-------------:|----------------:|----------------:|----------------:|
| gemma4 | pp2048 | 7626.50 ± 191.09 | | 251.39 ± 11.62 | 248.79 ± 11.62 | 251.45 ± 11.64 |
| gemma4 | tg128 | 57.26 ± 0.21 | 58.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d4096 | 7044.02 ± 1496.52 | | 882.38 ± 216.67 | 879.78 ± 216.67 | 882.43 ± 216.67 |
| gemma4 | tg128 @ d4096 | 55.52 ± 0.04 | 56.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d8192 | 7103.98 ± 176.92 | | 1397.95 ± 28.14 | 1395.35 ± 28.14 | 1398.01 ± 28.15 |
| gemma4 | tg128 @ d8192 | 54.55 ± 0.11 | 55.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d16384 | 6167.66 ± 19.81 | | 2906.34 ± 13.57 | 2903.74 ± 13.57 | 2906.39 ± 13.58 |
| gemma4 | tg128 @ d16384 | 53.63 ± 0.10 | 54.33 ± 0.47 | | | |
llama-benchy (0.3.5)
date: 2026-04-03 10:01:58 | latency mode: api
Gemma 4 31B - Comparison cyankiwi AWQ-4bit vs FP8 quantization:
cyankiwi/gemma-4-31B-it-AWQ-4bit
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------|----------------:|----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| gemma4 | pp2048 | 1509.32 ± 28.16 | | 1249.71 ± 57.50 | 1248.16 ± 57.50 | 1249.77 ± 57.49 |
| gemma4 | tg128 | 18.70 ± 0.02 | 19.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d4096 | 1445.17 ± 9.64 | | 4081.66 ± 38.36 | 4080.12 ± 38.36 | 4081.72 ± 38.37 |
| gemma4 | tg128 @ d4096 | 18.43 ± 0.04 | 19.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d8192 | 1366.90 ± 0.06 | | 7264.20 ± 23.99 | 7262.66 ± 23.99 | 7264.28 ± 23.98 |
| gemma4 | tg128 @ d8192 | 18.12 ± 0.04 | 19.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d16384 | 1190.44 ± 1.30 | | 15018.70 ± 12.20 | 15017.16 ± 12.20 | 15018.76 ± 12.20 |
| gemma4 | tg128 @ d16384 | 17.71 ± 0.05 | 18.00 ± 0.00 | | | |
llama-benchy (0.3.5)
date: 2026-04-03 05:22:59 | latency mode: api
google/gemma-4-31B-it
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------|----------------:|-----------------:|-------------:|------------------:|------------------:|------------------:|
| gemma4 | pp2048 | 2716.85 ± 111.27 | | 704.67 ± 33.60 | 701.20 ± 33.60 | 704.74 ± 33.59 |
| gemma4 | tg128 | 12.31 ± 0.05 | 13.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d4096 | 1347.25 ± 9.61 | | 4325.34 ± 67.74 | 4321.87 ± 67.74 | 4325.41 ± 67.74 |
| gemma4 | tg128 @ d4096 | 12.25 ± 0.01 | 13.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d8192 | 1340.52 ± 8.04 | | 7336.79 ± 24.76 | 7333.33 ± 24.76 | 7336.89 ± 24.77 |
| gemma4 | tg128 @ d8192 | 12.11 ± 0.00 | 13.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d16384 | 1125.52 ± 24.89 | | 15893.75 ± 419.09 | 15890.28 ± 419.09 | 15893.84 ± 419.07 |
| gemma4 | tg128 @ d16384 | 11.92 ± 0.04 | 12.67 ± 0.47 | | | |
llama-benchy (0.3.5)
date: 2026-04-03 10:22:49 | latency mode: api
Trigger Thinking: Thinking is enabled by including the
<|think|>token at the start of the system prompt. To disable thinking, remove the token.
Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
<|channel>thought\n[Internal reasoning]<channel|>
As far as I know Open WebUI and Cherry Studio expect <think> </think> tags. Never seen this <|channel><channel|> variant before.