How to run Gemma-4-NVFP4 in vLLM Docker?

I vibe coded a simple Chainlit chat + image + video UI in Pinokio to test Gemma-4, I figured everyone post stats I want to show you how it runs, the first section of the video is regular speed the second pass i sped it up because as you can tell its not the fastest, I think I have room to tweak it a bit and get better results maybe idk.. but I hope this helps, oh FYI it’s using 122-124GB out of the 128GB but at least it;s not crashing!

Music: ACE-Step via Pinokio

Running Gemma-4-31B-IT-NVFP4 on DGX Spark with vLLM 0.18.2rc1

  • Model: nvidia/Gemma-4-31B-IT-NVFP4 (NVFP4 via ModelOpt)

  • Quantization: modelopt

  • Max model length: 65536

  • GPU memory utilization: 0.85

  • KV cache dtype: fp8

  • Enabled: chunked prefill, prefix caching, reasoning_parser=gemma4, tool_call_parser=gemma4, auto tool choice

  • Max num seqs: 4

  • Max num batched tokens: 8192

  • Attention backend: TRITON_ATTN (forced by heterogeneous heads)

  • Host: 127.0.0.1:8000

Using for a desktop multimodal agent with screen, PDF, and video understanding.

Benchmark - first run: 2.5 tok/s · 341 out · 546 in · TTFT 2.23s · 137.3s total