Memory Creep on DGX Spark: Where Your 128 GB Actually Goes (And How to Stop It)

# Memory Creep on DGX Spark: Where Your 128 GB Actually Goes (And How to Stop It)

**System:** DGX Spark (GB10, sm_121, 128 GB unified memory)
**Model:** Nemotron-3-Nano-30B-A3B-NVFP4 (19 GB on disk)

> **Note:** These are my findings from poking at my DGX Spark and reading a lot of forum posts. If anything here is wrong or could be better, please correct me — that’s the whole point of sharing. Big thanks to the Spark community (especially eugr’s container work and the Marlin backend discovery) for making this possible.

-–

## The Symptom

You download a 19 GB NVFP4 model. You launch vLLM. You watch memory climb:

```
Time Used GB What’s happening
─────────────────────────────────────────
0s 7 GB Baseline (OS + desktop)
30s 28 GB Model loading…
60s 29 GB Still loading shards…
120s 30 GB Model loaded. Looks reasonable.
121s 117 GB ← What just happened?!
```

That jump from 30 to 117 GB? That’s vLLM allocating **89 GB of KV cache** you’ll never use.

## The 4 Horsemen of Memory Bloat

We tracked every GB. Here’s where your memory goes with default vLLM settings on DGX Spark:

**Default vLLM (117 GB total)**

Component Size % of Memory
Model weights 19 GB 16%
Runtime (Python, CUDA) 3 GB 3%
torch.compile + CUDA graphs 13 GB 11%
**KV Cache (pre-allocated)** **89 GB** **76%**

**Optimized (32 GB total)**

Component Size % of Memory
Model weights 19 GB 59%
Runtime (Python, CUDA) 3 GB 9%
torch.compile + CUDA graphs **0 GB** 0%
KV Cache (minimal) 4 GB 13%
**Free for the rest of your system** **89 GB**

**That’s 85 GB saved — mostly KV cache you were never going to use.**

### Horseman 1: KV Cache Pre-Allocation (89 GB wasted)

vLLM defaults to `gpu_memory_utilization=0.9`. On DGX Spark with 128 GB unified memory, it sees ~100 GB free after model load and fills 90% of it with KV cache blocks. This pre-allocates for **1,247 concurrent 8K-token requests**.

If you’re running single-user inference, you need maybe 1-5 concurrent requests. That’s 89 GB of pre-allocated memory sitting idle.

**Fix:** `–gpu-memory-utilization 0.2` (minimum viable for a 19 GB model)

### Horseman 2: torch.compile + CUDA Graphs (13 GB overhead)

Without `–enforce-eager`, vLLM compiles the model with torch.compile’s Inductor backend and captures CUDA graphs for different batch sizes. On unified memory, these compilation artifacts and graph copies compete directly with your OS and model.

Our test: CUDA graphs added **13 GB** for only **3% faster** inference (51.6 vs 50.0 tok/s). Not worth it for single-user.

**Fix:** `–enforce-eager`

### Horseman 3: Broken FlashInfer CUTLASS Kernels (7 GB overhead)

SM121 lacks `tcgen05` instructions. The FlashInfer CUTLASS FP4 kernels fail:

```
[TensorRT-LLM][ERROR] Failed to initialize cutlass TMA WS grouped gemm
```

The autotuner skips broken tactics and falls back, but the fallback path uses more memory (39 GB vs 32 GB in our tests) and runs 16% slower.

**Fix:** Force Marlin backend (see [Post 2: The Marlin Fix](#))

### Horseman 4: FlashInfer JIT Compilation Spike (20-30 GB transient)

The NVIDIA vLLM container ships sm_120 precompiled FlashInfer kernels, NOT sm_121. At startup, FlashInfer JIT-compiles 6+ CUTLASS MoE GEMM kernels. Each `cicc` compiler process uses 1.5-6 GB RAM. Six in parallel = 20-30 GB spike on unified memory.

This is the **memory creep** that makes monitoring look like a sawtooth wave. Memory climbs as compilers spawn, partially drops as they finish, climbs again for the next kernel.

**Fix:** Use eugr’s prebuilt sm_121 wheels (`vllm-node:latest` container) — zero JIT compilation.

## Memory Timeline: Before vs After

### Before (NVIDIA container, default settings)

Time Memory What’s happening
0s 7 GB Baseline
30s 28 GB Model shards loading
60s 40 GB JIT compilers spawning (cicc processes)
90s 55 GB More JIT kernels compiling in parallel
120s 30 GB JIT finished, model loaded
121s **117 GB** KV cache pre-allocated (the big jump)
180s+ 117-120 GB Stable but bloated

Memory kept **climbing unpredictably** during the JIT phase — sawtooth pattern as compilers spawn and finish.

### After (eugr container, Marlin + enforce_eager + 0.2 util)

Time Memory What’s happening
0s 6 GB Baseline (after cache flush)
30s 28 GB Model shards loading
60s 29 GB Still loading (no JIT!)
120s 30 GB Model loaded
180s **32 GB** Server ready, KV cache allocated (4 GB)
180s+ **32 GB** Flat. Stable. No creep.

Memory during inference: **32.1 → 32.8 GB** (+0.7 GB for active KV). That’s it. No surprises.

## DGX Spark-Specific Memory Tips

1. **`nvidia-smi` doesn’t report memory on GB10** — use `/proc/meminfo` instead:
```bash
watch -n1 “awk ‘/MemTotal/{t=\$2}/MemAvailable/{a=\$2}END{printf \“Used: %.1f GB / %.1f GB\\n\”,(t-a)/1048576,t/1048576}’ /proc/meminfo”
```

2. **Flush buffer caches** before launching inference:
```bash
sudo sh -c ‘sync; echo 3 > /proc/sys/vm/drop_caches’
```
Unified memory means Linux buffer cache competes with GPU memory.

3. **Disable the desktop** for headless operation (saves 2-3 GB):
```bash
sudo systemctl set-default multi-user.target
```

4. **System tuning** for unified memory:
```bash
sudo sysctl vm.swappiness=1
sudo sysctl vm.dirty_bytes=268435456
```

5. **fastsafetensors warning**: Don’t use `–load-format fastsafetensors` with `gpu_memory_utilization > 0.76` on unified memory — risk of system freeze.

## Quick Reference: The Fix

```bash

3 env vars + 4 flags = 32 GB instead of 120 GB

docker run --runtime=nvidia \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-e VLLM_TEST_FORCE_FP8_MARLIN=1 \
vllm-node:latest \
python3 -m vllm.entrypoints.openai.api_server \
–model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
–enforce-eager \
–gpu-memory-utilization 0.2 \
–max-model-len 8192 \
–kv-cache-dtype fp8 \
–trust-remote-code
```

-–

**Up next:** Part 2 — The Marlin Fix (why NVFP4 is silently broken on SM121 and the 3 env vars that fix it). Stay tuned.

*Tested March 26, 2026 — DGX Spark GB10, CUDA 13.2, Driver 580.142, vLLM 0.18.1rc1 (eugr build)*

## Memory Breakdown

```
┌─────────────────────────────────────────────────────────┐
│ DEFAULT vLLM (117 GB) │
├─────────────────────────────────────────────────────────┤
│ ████████████████████ Model: 19 GB │
│ ███ Runtime: 3 GB │
│ █████████████ torch.compile: 13 GB │
│ ████████████████████████████████████████████████████████ │
│ ████████████████████████████████████████████████████████ │
│ KV Cache: 89 GB │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ OPTIMIZED (32 GB) │
├─────────────────────────────────────────────────────────┤
│ ████████████████████ Model: 19 GB │
│ ███ Runtime: 3 GB │
│ torch.compile: 0 GB │
│ ████████ KV Cache: 4 GB │
│ Free: 89 GB │
└─────────────────────────────────────────────────────────┘
```

Idea for this forum: Can we establish a policy requiring AI-generated content to be clearly marked? Then users can save themselves the time of clicking on such content.

Yes, I used AI to help structure the write-up. The data is real from a DGX Spark to my measurements, my /proc/meminfo logs. 117 GB → 32 GB isn’t generated, it’s observed.

Happy to discuss the findings if anything looks off.