GH200 memory not clearing

ayushkoul00 · June 6, 2024, 8:29pm

I am trying to run some ML models on a GH200 GPU using CuPy, CuML and CuDF in Python. However, the memory in the GPU is almost filled. No processes are running as well. Since my program deals with a lot of memory, I think it allocates the memory but then doesn’t clear it. I have already used del and gc.collect() in my code but that doesn’t seem to be working.

Additionally, now the memory is almost all filled even when there are no processes running. This is the output I get when I run nvidia-smi:

Because of this, My program crashes almost right away when it starts (see logs below):

2024-06-05 20:26:39,190 - ERROR - processing of <FILE_NAME> with exception std::bad_alloc: out_of_memory: CUDA error at: /home/<DIR>/rapids-24.02/include/rmm/mr/device/cuda_memory_resource.hpp

Robert_Crovella · June 10, 2024, 1:52pm

This is a common issue. It results from previous, improper application behavior, or improper application termination. Its not really unique or specific to GH200.

The suggestion that “always” works is to reboot the system.

Other suggestions that may work are to try terminating remaining pieces of previous activity (killing processes that need to be killed to release resources.)

With a bit of searching you can find other questions like this, along with their associated suggestions. Here is one, there are others.

Curefab · June 11, 2024, 3:22pm

Just an idea: Would the MIG (Multi-Instance GPU) feature with a single instance be a possible solution to better guarantee freed resources?

Robert_Crovella · June 11, 2024, 6:04pm

It’s probably worth a try. I haven’t tried it. Not all GPUs have MIG capability (in the general case), but its worth a try for the Hopper GPU(s) in view here.

ilb · May 1, 2025, 10:46pm

Regardless if this is an old topic, I’m adding for reference. What worked for us was either to:

stop all processes that were currently accessing the gpu (stop all exporters, dcgmi, nv-persistenced, …) the last one freed also the memory (since persistence mode went off) as soon as the driver unloaded cleared also the memory, or
more interestingly, to execute a clear caches (no stopping of processes or reboots were required in this case)

sync
echo 3 >/proc/sys/vm/drop_caches

Topic		Replies	Views
10 GB of GPU RAM used, and no process listed by nvidia-smi CUDA Setup and Installation cuda , nvbugs , pytorch	1	3204	June 15, 2023
11 GB of GPU RAM used, and no process listed by nvidia-smi CUDA Programming and Performance	17	146337	September 22, 2023
How to find leaks? cuda-gdb runs out of memory, but compute-sanitizer runs without erros CUDA-GDB	9	4383	March 22, 2023
how to effectively free large memory allocation CUDA Programming and Performance	8	7661	November 5, 2015
How to kill unknown process that eating up the GPU memory? CUDA Programming and Performance cuda , kernel	2	6411	February 1, 2023
GPU soft-reset / freeing all resources possible? CUDA Programming and Performance	14	21854	February 21, 2011
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	4042	January 28, 2022
Buffer Overflow Detected CUDA Programming and Performance	0	3646	July 31, 2011
Kernel maxing out GPU memory when it definitely should not be CUDA Programming and Performance	2	536	October 29, 2018
Taking up 8%~ of vram with absolutely no application running CUDA Programming and Performance cuda	9	43	June 4, 2025

GH200 memory not clearing

Related topics