How to run Gemma-4-NVFP4 in vLLM Docker?

TheAwakenOne · April 4, 2026, 4:05am

I vibe coded a simple Chainlit chat + image + video UI in Pinokio to test Gemma-4, I figured everyone post stats I want to show you how it runs, the first section of the video is regular speed the second pass i sped it up because as you can tell its not the fastest, I think I have room to tweak it a bit and get better results maybe idk.. but I hope this helps, oh FYI it’s using 122-124GB out of the 128GB but at least it;s not crashing!

Music: ACE-Step via Pinokio

Running Gemma-4-31B-IT-NVFP4 on DGX Spark with vLLM 0.18.2rc1

Model: nvidia/Gemma-4-31B-IT-NVFP4 (NVFP4 via ModelOpt)
Quantization: modelopt
Max model length: 65536
GPU memory utilization: 0.85
KV cache dtype: fp8
Enabled: chunked prefill, prefix caching, reasoning_parser=gemma4, tool_call_parser=gemma4, auto tool choice
Max num seqs: 4
Max num batched tokens: 8192
Attention backend: TRITON_ATTN (forced by heterogeneous heads)
Host: 127.0.0.1:8000

Using for a desktop multimodal agent with screen, PDF, and video understanding.

Benchmark - first run: 2.5 tok/s · 341 out · 546 in · TTFT 2.23s · 137.3s total

Topic		Replies	Views
Gemma 4 Models - which vLLM version? Any PRs spotted? DGX Spark / GB10 nim , llama	177	11379	April 16, 2026
Google Gemma 4 - It will work on DGX Spark? DGX Spark / GB10 agentic-ai	22	2569	April 5, 2026
"vLLM + Gemma 4 on NVIDIA DGX Spark GB10" - has anyone testing this implementation? DGX Spark / GB10	1	558	April 29, 2026
Gemma 4 -- here we go again DGX Spark / GB10	11	3154	April 15, 2026
Gemma 4 31B on DGX Spark: Runtime FP8 Benchmarks — Single & Dual Node (TP=2) DGX Spark / GB10 llama , agentic-ai	0	2424	April 7, 2026
Help finding issue in eugr/spark-vllm-docker vs vllm/vllm-openai:gemma4-cu130 running gemma-4-26b-a4b-it DGX Spark / GB10	0	119	May 20, 2026
Docker container image for recent vLLM release that enables GGUF loading Docker and NVIDIA Docker	4	1079	April 14, 2026
Does anyone have Gemma 4 31B running on Spark DGX? DGX Spark / GB10	8	2821	April 9, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	4530	February 13, 2026
Gemma 4 MTP DGX Spark / GB10	3	2163	May 5, 2026

How to run Gemma-4-NVFP4 in vLLM Docker?

Related topics