## Summary
Repeated image inference with Gemma 4 31B + mmproj on llama-server causes two separate memory leaks:
1. **Host memory (RSS)**: ~46 MiB/request — **root-caused to cuBLAS internal workspace caching, fixed by periodic handle reset**
2. **VRAM (device memory)**: ~6.4 MiB/request — **unresolved, persists across all tracked CUDA allocation APIs**
Both leaks are **Gemma 4 + image specific**. Qwen3.5-27B with identical VLM setup shows zero leak on both metrics.
—
## Environment
- **GPU**: NVIDIA GeForce RTX 5090 (32 GB VRAM)
- **OS**: Windows 11 + WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)
- **CUDA**: 12.8
- **Driver**: 595.71
- **llama.cpp**: turboquant fork (TheTom/llama-cpp-turboquant, based on ggml-org HEAD ~2026-04-07)
- **Also tested**: standard ggml-org/llama.cpp (same VRAM leak confirmed)
- **Model**: `unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL` (17 GB)
- **mmproj**: extracted from `Qwen/Qwen3.5-27B` and `google/gemma-4-31b-it` using `convert_hf_to_gguf.py --mmproj`
## Reproduction
```bash
llama-server \
-m gemma-4-31B-it-UD-Q4_K_XL.gguf \
–mmproj gemma-4-31b-it-mmproj.gguf \
-c 4096 -ngl 999 --flash-attn on \
–cache-type-k turbo3 --cache-type-v turbo3 \
–ctx-checkpoints 0 --cache-ram 0
# Send 30 identical image requests via /v1/chat/completions with max_tokens=16
# Monitor: nvidia-smi memory.used + /proc//status VmRSS
```
—
## Controlled Experiments
### Leak is Gemma 4 + Image Specific
| Test | 50 requests | VRAM | RSS | Stabilizes? |
|------|------------|------|-----|-------------|
| **Gemma 4 + mmproj + images** | ✅ | **+322 MiB (6.4/req)** | **+925 MiB (46/req)** | ❌ Never |
| Gemma 4 text-only (no mmproj) | ✅ | +19 MiB | 0 | ✅ Yes |
| Qwen3.5-27B + mmproj + images | ✅ | +56 MiB then stable | 0 | ✅ Yes |
| Gemma 4 + mmproj + max_tokens=1 | ✅ | +82 MiB then stable | — | ✅ Yes (5 req) |
**Key observations:**
- Leak rate is proportional to **decode token count** (more generated tokens = more leak)
- Same image repeated → same leak rate (not image-size dependent)
- Text-only Gemma 4 → no leak (SWA architecture alone is fine)
- Qwen3.5 with identical VLM pipeline → no leak
### Leak is NOT in vision encoder
Instrumented `clip_image_batch_encode()` with `ggml_backend_dev_memory()` probes at each stage:
```
Request #3 START: 22232 MiB
After warmup check: 22232 MiB (delta +0)
After sched_reset: 22232 MiB (delta +0)
After build_graph: 22232 MiB (delta +0)
After alloc_graph: 22232 MiB (delta +0)
After graph_compute: 22232 MiB (delta +0)
Request #3 END: 22232 MiB (delta +0)
```
Vision encoder is clean from request 3 onwards. Leak occurs **during llama_decode()** when generating text with image tokens in KV cache.
### Graph allocator buffers are constant
Instrumented `process_ubatch()` → `ggml_backend_sched_alloc_graph()`:
```
alloc #1: backend 0 buf_size=522.50 MiB, graph nodes=2582
alloc #2: backend 0 buf_size=522.50 MiB, graph nodes=2582
…
alloc #50: backend 0 buf_size=522.50 MiB, graph nodes=2582
```
Buffer size never changes. `ggml_gallocr_needs_realloc()` returns false after initial warmup.
### ALL tracked CUDA allocations are balanced
**LD_PRELOAD interceptor** tracking `cudaMalloc`/`cudaFree`/`cuMemCreate`/`cuMemMap`/`cuMemUnmap`:
```
cudaMalloc: count=67, live at exit = 0.00 MiB (perfectly balanced)
cudaFree: count=61, difference = 6 permanent buffers (model weights, KV cache)
cuMemCreate: count=9, total=64.00 MiB
cuMemMap: count=9
cuMemUnmap: count=2 (at exit only)
VMM net mapped at exit: 0.00 MiB
```
**No cudaMalloc or cuMemCreate leak.** VRAM growth is from allocations NOT visible to these APIs.
—
## Root Cause Analysis
### Host Memory Leak (FIXED)
**Cause**: cuBLAS internal host-side workspace caching. cuBLAS allocates host memory for each unique matmul configuration. Gemma 4’s hybrid attention (SWA head_dim=256 + full head_dim=512) combined with image prefill batches (300+ tokens) vs decode batches (1 token) creates many unique configurations. The workspace is only freed on `cublasDestroy()`.
**Fix**: Periodic `cublasDestroy()` + lazy recreate every 100 graph computes:
```cpp
// in ggml_backend_cuda_graph_compute(), after graph evaluation:
static thread_local int compute_count = 0;
if (++compute_count % 100 == 0) {
cuda_ctx->cublas_reset(); // cublasDestroy + nullptr (lazy recreate)
}
```
Result: RSS delta went from **+925 MiB → 0 MiB** over 20 requests.
**Note**: Resetting every compute (instead of every 100) causes **worse** RSS leak from create/destroy overhead.
### VRAM Leak (UNRESOLVED)
After fixing the host memory leak, VRAM still grows at ~6.4 MiB/request.
**What we ruled out:**
- cudaMalloc/cudaFree balance: perfect
- cuMemCreate/cuMemMap: balanced, stops growing after 9 events
- Graph allocator buffers: constant (522.50 MiB every alloc)
- cuBLAS device workspace (cublasSetWorkspace): no effect
- CUDA memory pool (cudaMemPoolAttrReleaseThreshold=0): no effect
- CUDA_LAUNCH_BLOCKING=1: made it worse (16.3/req)
- turbo3 KV cache: same leak without it (f16 KV = 7.5/req)
- Context checkpoints (–ctx-checkpoints 0): no effect
- Prompt cache (–cache-ram 0): no effect
- Legacy CUDA pool (ggml_cuda_pool_leg vs VMM): same leak
- VMM pool shrink (cuMemUnmap in free): crashes
**What remains**: VRAM growth is not from any tracked allocation API (`cudaMalloc`, `cuMemCreate`, `cublasSetWorkspace`, `cudaMemPool`). It appears to be CUDA runtime/driver internal state (kernel launch metadata, context bookkeeping) that accumulates when processing varying compute patterns specific to Gemma 4 + image tokens.
—
## Cross-Framework Validation
| Framework | Gemma 4 Vision Memory Issue | Status |
|-----------|---------------------------|--------|
| **llama.cpp** | VRAM grows ~6.4 MiB/req | This report |
| **vLLM** | [#28230]([Bug]: GPU VRAM continuously increase during Qwen3-VL usage over days until OOM · Issue #28230 · vllm-project/vllm · GitHub) Qwen3-VL VRAM grows | OPEN |
| **SGLang** | SWAKVPool double-free with dual KV cache | Fixed in fork |
The VRAM leak appears to be a **systemic issue across frameworks** when handling Gemma 4’s hybrid SWA architecture with vision inputs.
—
## Suggested Investigation Directions
1. **CUDA runtime internal allocation tracking**: `cudaMemGetInfo` reports decreasing free memory, but no tracked allocation API accounts for it. NVIDIA’s `compute-sanitizer` or internal tooling may reveal which CUDA runtime subsystem is accumulating device memory.
2. **Gemma 4 Shared KV Cache interaction**: Gemma 4 has `num_kv_shared_layers=20` where later layers reuse K/V from earlier layers. This shared reference pattern during sequence processing (especially with varying-length image token sequences) may cause the KV cache management to retain more state than expected.
3. **Flash attention kernel workspace**: The leak occurs during `llama_decode()` which uses `ggml_flash_attn_ext` (not cuBLAS) for attention. The flash attention kernel may allocate per-launch device memory that isn’t properly reclaimed between varying batch sizes.
4. **Profile with `nsys`/`ncu`**: Per-kernel device memory delta analysis would pinpoint exactly which kernel launch causes the VRAM increase.
—
## Files & Tools
- **LD_PRELOAD CUDA tracker**: intercepts `cudaMalloc`/`cudaFree`/`cuMemCreate`/`cuMemMap`/`cuMemUnmap`
- **LD_PRELOAD malloc tracker**: intercepts `malloc`/`free`/`realloc` with caller address logging
- **RSS tracking test script**: measures both nvidia-smi VRAM and `/proc//status VmRSS` per request
- **cuBLAS reset patch**: periodic `cublasDestroy()`/recreate (fixes host memory leak)
All available on request.