Gemma 4 Models - which vLLM version? Any PRs spotted?

Digital_David · April 7, 2026, 3:32pm

There are lots of gains to the found with Gemma 4 on DGX Spark. Concurrent processes is just as important as single user output for “Ai Agent” use. Modified @dbsci is the fastest I’ve tested so far:

recipe_version: "1"
model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
runtime: vllm
solo_only: false
cluster_only: false
container: vllm-node-gem-26b-awq
name: Gemma4-26B-A4B-AWQ
description: Gemma4-26B-A4B-INT4-AWQ — AWQ quantized Gemma 4 26B MoE
build_args: []
mods: []
defaults:              
  model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  pipeline_parallel: 1
  gpu_memory_utilization: 0.8
  max_model_len: 262144
  max_num_batched_tokens: 8192
  kv_cache_dtype: auto
  quantization: compressed-tensors
  tool_call_parser: "gemma4"
  reasoning_parser: "gemma4"
  load_format: "fastsafetensors"

env:
  VLLM_MARLIN_USE_ATOMIC_ADD: "1"

command: |
  vllm serve {model} \
    --quantization {quantization} \
    --max-model-len {max_model_len} \
    --kv-cache-dtype {kv_cache_dtype} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --host {host} \
    --port {port} \
    --load-format {load_format} \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --tool-call-parser {tool_call_parser} \
    --enable-auto-tool-choice \
    --reasoning-parser {reasoning_parser} \
    --async-scheduling \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --trust-remote-code \
    -tp {tensor_parallel} \
    -pp {pipeline_parallel}

Results:

Comparison: FP8 vs. AWQ V1 vs. AWQ V2
Configuration FP8 (Base) AWQ V1 (Old) AWQ V2 (New)
1 Stream (Latency) 41.53 tokens/s 35.83 tokens/s 39.79 tokens/s
8 Streams (Balanced) 12.10 tokens/s 12.10 tokens/s 11.94 tokens/s
32 Streams (Throughput) 9.57 tokens/s 15.82 tokens/s 15.82 tokens/s

Key Findings:

Single-Stream Improvement: In single-user mode, the new AWQ V2 is actually 11.1% faster than the old AWQ V1. This confirms that the optimizations we applied to the implementation (like the improved handling of the container lifecycle and configuration) have successfully recovered some of the quantization latency.
The Scaling Floor is Stable: At 8 and 32 streams, the performance remains consistent with our previous findings. The AWQ model continues to provide a massive ~65% throughput advantage over the FP8 model once we move into multi-agent, parallel operations.
Conclusion: The “New” AWQ configuration is the optimal production choice. It provides the best balance of single-user responsiveness while maintaining the massive parallel throughput capacity required for the DGX

Topic		Replies	Views
Gemma 4 31B on DGX Spark: Runtime FP8 Benchmarks — Single & Dual Node (TP=2) DGX Spark / GB10 llama , agentic-ai	0	2253	April 7, 2026
Google Gemma 4 - It will work on DGX Spark? DGX Spark / GB10 agentic-ai	23	2502	April 19, 2026
Gemma 4 Day-1 Inference on NVIDIA DGX Spark — Preliminary Benchmarks DGX Spark / GB10 llama , agentic-ai	17	7956	April 7, 2026
How to run Gemma-4-NVFP4 in vLLM Docker? DGX Spark / GB10	12	5117	April 12, 2026
46tok/s with RedHatAI/gemma-4-26B-A4B-it-NVFP4 DGX Spark / GB10 llama	18	1431	May 6, 2026
Newb alert! Qwen 3.5/3.6 Gemma 4 26B / 35B downloading and speed! Help! DGX Spark / GB10	3	924	May 5, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4466	February 27, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	3122	December 31, 2025
Help finding issue in eugr/spark-vllm-docker vs vllm/vllm-openai:gemma4-cu130 running gemma-4-26b-a4b-it DGX Spark / GB10	0	95	May 20, 2026
Gemma 4 -- here we go again DGX Spark / GB10	11	3087	April 15, 2026

Gemma 4 Models - which vLLM version? Any PRs spotted?

Related topics