Gemma 4 VLM VRAM/Host Memory Leak — Full Investigation Report

infinitylawofbigbang · April 10, 2026, 8:00pm

## Summary

Repeated image inference with Gemma 4 31B + mmproj on llama-server causes two separate memory leaks:

1. **Host memory (RSS)**: ~46 MiB/request — **root-caused to cuBLAS internal workspace caching, fixed by periodic handle reset**

2. **VRAM (device memory)**: ~6.4 MiB/request — **unresolved, persists across all tracked CUDA allocation APIs**

Both leaks are **Gemma 4 + image specific**. Qwen3.5-27B with identical VLM setup shows zero leak on both metrics.

—

## Environment

- **GPU**: NVIDIA GeForce RTX 5090 (32 GB VRAM)

- **OS**: Windows 11 + WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)

- **CUDA**: 12.8

- **Driver**: 595.71

- **llama.cpp**: turboquant fork (TheTom/llama-cpp-turboquant, based on ggml-org HEAD ~2026-04-07)

- **Also tested**: standard ggml-org/llama.cpp (same VRAM leak confirmed)

- **Model**: `unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL` (17 GB)

- **mmproj**: extracted from `Qwen/Qwen3.5-27B` and `google/gemma-4-31b-it` using `convert_hf_to_gguf.py --mmproj`

## Reproduction

```bash

llama-server \

-m gemma-4-31B-it-UD-Q4_K_XL.gguf \

–mmproj gemma-4-31b-it-mmproj.gguf \

-c 4096 -ngl 999 --flash-attn on \

–cache-type-k turbo3 --cache-type-v turbo3 \

–ctx-checkpoints 0 --cache-ram 0

# Send 30 identical image requests via /v1/chat/completions with max_tokens=16

# Monitor: nvidia-smi memory.used + /proc//status VmRSS

```

—

## Controlled Experiments

### Leak is Gemma 4 + Image Specific

|------|------------|------|-----|-------------|

| Gemma 4 text-only (no mmproj) | ✅ | +19 MiB | 0 | ✅ Yes |

| Qwen3.5-27B + mmproj + images | ✅ | +56 MiB then stable | 0 | ✅ Yes |

**Key observations:**

- Leak rate is proportional to **decode token count** (more generated tokens = more leak)

- Same image repeated → same leak rate (not image-size dependent)

- Text-only Gemma 4 → no leak (SWA architecture alone is fine)

- Qwen3.5 with identical VLM pipeline → no leak

### Leak is NOT in vision encoder

Instrumented `clip_image_batch_encode()` with `ggml_backend_dev_memory()` probes at each stage:

```

Request #3 START: 22232 MiB

After warmup check: 22232 MiB (delta +0)

After sched_reset: 22232 MiB (delta +0)

After build_graph: 22232 MiB (delta +0)

After alloc_graph: 22232 MiB (delta +0)

After graph_compute: 22232 MiB (delta +0)

Request #3 END: 22232 MiB (delta +0)

```

Vision encoder is clean from request 3 onwards. Leak occurs **during llama_decode()** when generating text with image tokens in KV cache.

### Graph allocator buffers are constant

Instrumented `process_ubatch()` → `ggml_backend_sched_alloc_graph()`:

```

alloc #1: backend 0 buf_size=522.50 MiB, graph nodes=2582

alloc #2: backend 0 buf_size=522.50 MiB, graph nodes=2582

…

alloc #50: backend 0 buf_size=522.50 MiB, graph nodes=2582

```

Buffer size never changes. `ggml_gallocr_needs_realloc()` returns false after initial warmup.

### ALL tracked CUDA allocations are balanced

**LD_PRELOAD interceptor** tracking `cudaMalloc`/`cudaFree`/`cuMemCreate`/`cuMemMap`/`cuMemUnmap`:

```

cudaMalloc: count=67, live at exit = 0.00 MiB (perfectly balanced)

cudaFree: count=61, difference = 6 permanent buffers (model weights, KV cache)

cuMemCreate: count=9, total=64.00 MiB

cuMemMap: count=9

cuMemUnmap: count=2 (at exit only)

VMM net mapped at exit: 0.00 MiB

```

**No cudaMalloc or cuMemCreate leak.** VRAM growth is from allocations NOT visible to these APIs.

—

## Root Cause Analysis

### Host Memory Leak (FIXED)

**Cause**: cuBLAS internal host-side workspace caching. cuBLAS allocates host memory for each unique matmul configuration. Gemma 4’s hybrid attention (SWA head_dim=256 + full head_dim=512) combined with image prefill batches (300+ tokens) vs decode batches (1 token) creates many unique configurations. The workspace is only freed on `cublasDestroy()`.

**Fix**: Periodic `cublasDestroy()` + lazy recreate every 100 graph computes:

```cpp

// in ggml_backend_cuda_graph_compute(), after graph evaluation:

static thread_local int compute_count = 0;

if (++compute_count % 100 == 0) {

cuda_ctx->cublas_reset(); // cublasDestroy + nullptr (lazy recreate)

}

```

Result: RSS delta went from **+925 MiB → 0 MiB** over 20 requests.

**Note**: Resetting every compute (instead of every 100) causes **worse** RSS leak from create/destroy overhead.

### VRAM Leak (UNRESOLVED)

After fixing the host memory leak, VRAM still grows at ~6.4 MiB/request.

**What we ruled out:**

- cudaMalloc/cudaFree balance: perfect

- cuMemCreate/cuMemMap: balanced, stops growing after 9 events

- Graph allocator buffers: constant (522.50 MiB every alloc)

- cuBLAS device workspace (cublasSetWorkspace): no effect

- CUDA memory pool (cudaMemPoolAttrReleaseThreshold=0): no effect

- CUDA_LAUNCH_BLOCKING=1: made it worse (16.3/req)

- turbo3 KV cache: same leak without it (f16 KV = 7.5/req)

- Context checkpoints (–ctx-checkpoints 0): no effect

- Prompt cache (–cache-ram 0): no effect

- Legacy CUDA pool (ggml_cuda_pool_leg vs VMM): same leak

- VMM pool shrink (cuMemUnmap in free): crashes

**What remains**: VRAM growth is not from any tracked allocation API (`cudaMalloc`, `cuMemCreate`, `cublasSetWorkspace`, `cudaMemPool`). It appears to be CUDA runtime/driver internal state (kernel launch metadata, context bookkeeping) that accumulates when processing varying compute patterns specific to Gemma 4 + image tokens.

—

## Cross-Framework Validation

| Framework | Gemma 4 Vision Memory Issue | Status |

|-----------|---------------------------|--------|

| **llama.cpp** | VRAM grows ~6.4 MiB/req | This report |

| **vLLM** | [#28230]([Bug]: GPU VRAM continuously increase during Qwen3-VL usage over days until OOM · Issue #28230 · vllm-project/vllm · GitHub) Qwen3-VL VRAM grows | OPEN |

| **SGLang** | SWAKVPool double-free with dual KV cache | Fixed in fork |

The VRAM leak appears to be a **systemic issue across frameworks** when handling Gemma 4’s hybrid SWA architecture with vision inputs.

—

## Suggested Investigation Directions

1. **CUDA runtime internal allocation tracking**: `cudaMemGetInfo` reports decreasing free memory, but no tracked allocation API accounts for it. NVIDIA’s `compute-sanitizer` or internal tooling may reveal which CUDA runtime subsystem is accumulating device memory.

2. **Gemma 4 Shared KV Cache interaction**: Gemma 4 has `num_kv_shared_layers=20` where later layers reuse K/V from earlier layers. This shared reference pattern during sequence processing (especially with varying-length image token sequences) may cause the KV cache management to retain more state than expected.

3. **Flash attention kernel workspace**: The leak occurs during `llama_decode()` which uses `ggml_flash_attn_ext` (not cuBLAS) for attention. The flash attention kernel may allocate per-launch device memory that isn’t properly reclaimed between varying batch sizes.

4. **Profile with `nsys`/`ncu`**: Per-kernel device memory delta analysis would pinpoint exactly which kernel launch causes the VRAM increase.

—

## Files & Tools

- **LD_PRELOAD CUDA tracker**: intercepts `cudaMalloc`/`cudaFree`/`cuMemCreate`/`cuMemMap`/`cuMemUnmap`

- **LD_PRELOAD malloc tracker**: intercepts `malloc`/`free`/`realloc` with caller address logging

- **RSS tracking test script**: measures both nvidia-smi VRAM and `/proc//status VmRSS` per request

- **cuBLAS reset patch**: periodic `cublasDestroy()`/recreate (fixes host memory leak)

All available on request.

njuffa · April 10, 2026, 11:06pm

This would seem to suggest a driver-level issue, as this environment variable affects behavior that is transparent to higher-level APIs like CUBLAS. Consider filing a bug with NVIDIA.

Topic		Replies	Views
With the same model and vLLM image, GB10 uses more VRAM than x86 + GPU DGX Spark / GB10	2	282	December 18, 2025
Huge memory leak CUDA Programming and Performance	16	5936	July 27, 2016
Issues with VRAM allocation while fine tuning LLM Linux gaming	1	209	September 16, 2025
Consumption of host memory increases abnormally CUDA Programming and Performance	5	5669	June 2, 2011
Gemma 4 Models - which vLLM version? Any PRs spotted? DGX Spark / GB10 nim , llama	170	6895	April 13, 2026
How to run Gemma-4-NVFP4 in vLLM Docker? DGX Spark / GB10	11	2559	April 12, 2026
vLLM container on Jetson Thor: second start fails until vm.drop_caches=3 — system issue or thor_vllm_container:25.08-py3-base bug? Jetson Thor generative_ai	25	812	November 20, 2025
The Ollama journal shows "unable to find a kv cache slot" Jetson Thor generative_ai	6	93	March 12, 2026
Using TensorRT models with NanoLLM causes memory leaks TensorRT cudnn	0	85	August 1, 2025
Is there a memory leak in CUDA CUDA Programming and Performance	6	7289	June 11, 2008

Gemma 4 VLM VRAM/Host Memory Leak — Full Investigation Report

Related topics