GH200: GPU HBM memory not released after restart — Page Cache accumulation via Unified Memory?

Environment:

  • Hardware: NVIDIA GH200 Grace Hopper Superchip (ARM64 / SBSA)
  • Driver: 580.159.03
  • OS: Ubuntu (ARM64)
  • Workload: DeepStream / GStreamer-based video analytics pipeline
  • Note: No official DeepStream .deb package exists for aarch64 SBSA. We extracted DeepStream from the official NVIDIA container image. We are uncertain whether this non-standard installation
    method contributes to the memory issue.

Symptom:
During stress testing, we observed GPU memory climbing abnormally. Checking with nvidia-smi, the memory occupied does not match the sum of running processes — indicating memory is being held outside of any tracked process.

Further investigation showed that after restarting our application, DCGM_FI_DEV_FB_USED drops only partially. Across multiple restart cycles, the baseline gradually increases even when the channel count remains stable. A full system reboot is required to restore GPU memory to baseline.

This issue is not observed on machines with NVIDIA RTX 5070 (AMD64 / PCIe) running the same software version under comparable load.

Current Hypothesis:

We believe the root cause is the interaction between Linux Page Cache and GH200’s Unified Memory Architecture (NVLink-C2C):

  • Continuous video I/O from the GStreamer pipeline fills the Linux Page Cache.
  • Because GH200 shares a physical HBM pool between CPU and GPU, some of those cached pages end up resident in GPU HBM.
  • DCGM_FI_DEV_FB_USED and host cached memory grow in tandem.
  • Running sync && echo 3 > /proc/sys/vm/drop_caches immediately releases both — which supports this hypothesis.

Primary suspects: GStreamer (higher suspicion) and Triton Inference Server.

On PCIe-attached GPUs (RTX 5070), CPU Page Cache and GPU VRAM are physically separate, so the same workload shows no abnormal accumulation.


Questions:

  1. Is this behavior expected on GH200 unified memory platforms running heavy video I/O workloads?
  2. Could extracting DeepStream from a container (due to no official .deb for aarch64 SBSA) affect CUDA/UVM memory management behavior?
  3. Is there a driver-level or CUDA-level knob (e.g., cudaMemAdvise, UVM eviction hints) to prevent Page Cache pages from being pinned in GPU HBM?
  4. Are there newer driver releases (post 580.159.03) that address Page Cache residency behavior on GH200?

Workaround currently in use:
sync && echo 3 > /proc/sys/vm/drop_caches

Any guidance on a more permanent solution would be greatly appreciated.