I vibe coded a simple Chainlit chat + image + video UI in Pinokio to test Gemma-4, I figured everyone post stats I want to show you how it runs, the first section of the video is regular speed the second pass i sped it up because as you can tell its not the fastest, I think I have room to tweak it a bit and get better results maybe idk.. but I hope this helps, oh FYI it’s using 122-124GB out of the 128GB but at least it;s not crashing!
Music: ACE-Step via Pinokio
Running Gemma-4-31B-IT-NVFP4 on DGX Spark with vLLM 0.18.2rc1
-
Model: nvidia/Gemma-4-31B-IT-NVFP4 (NVFP4 via ModelOpt)
-
Quantization: modelopt
-
Max model length: 65536
-
GPU memory utilization: 0.85
-
KV cache dtype: fp8
-
Enabled: chunked prefill, prefix caching, reasoning_parser=gemma4, tool_call_parser=gemma4, auto tool choice
-
Max num seqs: 4
-
Max num batched tokens: 8192
-
Attention backend: TRITON_ATTN (forced by heterogeneous heads)
-
Host: 127.0.0.1:8000
Using for a desktop multimodal agent with screen, PDF, and video understanding.
Benchmark - first run: 2.5 tok/s · 341 out · 546 in · TTFT 2.23s · 137.3s total