Gemma 4 Models - which vLLM version? Any PRs spotted?

There are lots of gains to the found with Gemma 4 on DGX Spark. Concurrent processes is just as important as single user output for “Ai Agent” use. Modified @dbsci is the fastest I’ve tested so far:

recipe_version: "1"
model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
runtime: vllm
solo_only: false
cluster_only: false
container: vllm-node-gem-26b-awq
name: Gemma4-26B-A4B-AWQ
description: Gemma4-26B-A4B-INT4-AWQ — AWQ quantized Gemma 4 26B MoE
build_args: []
mods: []
defaults:              
  model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  pipeline_parallel: 1
  gpu_memory_utilization: 0.8
  max_model_len: 262144
  max_num_batched_tokens: 8192
  kv_cache_dtype: auto
  quantization: compressed-tensors
  tool_call_parser: "gemma4"
  reasoning_parser: "gemma4"
  load_format: "fastsafetensors"

env:
  VLLM_MARLIN_USE_ATOMIC_ADD: "1"

command: |
  vllm serve {model} \
    --quantization {quantization} \
    --max-model-len {max_model_len} \
    --kv-cache-dtype {kv_cache_dtype} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --host {host} \
    --port {port} \
    --load-format {load_format} \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --tool-call-parser {tool_call_parser} \
    --enable-auto-tool-choice \
    --reasoning-parser {reasoning_parser} \
    --async-scheduling \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --trust-remote-code \
    -tp {tensor_parallel} \
    -pp {pipeline_parallel}

Results:

Comparison: FP8 vs. AWQ V1 vs. AWQ V2
Configuration FP8 (Base) AWQ V1 (Old) AWQ V2 (New)
1 Stream (Latency) 41.53 tokens/s 35.83 tokens/s 39.79 tokens/s
8 Streams (Balanced) 12.10 tokens/s 12.10 tokens/s 11.94 tokens/s
32 Streams (Throughput) 9.57 tokens/s 15.82 tokens/s 15.82 tokens/s

Key Findings:

  • Single-Stream Improvement: In single-user mode, the new AWQ V2 is actually 11.1% faster than the old AWQ V1. This confirms that the optimizations we applied to the implementation (like the improved handling of the container lifecycle and configuration) have successfully recovered some of the quantization latency.

  • The Scaling Floor is Stable: At 8 and 32 streams, the performance remains consistent with our previous findings. The AWQ model continues to provide a massive ~65% throughput advantage over the FP8 model once we move into multi-agent, parallel operations.

  • Conclusion: The “New” AWQ configuration is the optimal production choice. It provides the best balance of single-user responsiveness while maintaining the massive parallel throughput capacity required for the DGX