I am trying to run some ML models on a GH200 GPU using CuPy, CuML and CuDF in Python. However, the memory in the GPU is almost filled. No processes are running as well. Since my program deals with a lot of memory, I think it allocates the memory but then doesn’t clear it. I have already used del and gc.collect() in my code but that doesn’t seem to be working.
Additionally, now the memory is almost all filled even when there are no processes running. This is the output I get when I run nvidia-smi:
This is a common issue. It results from previous, improper application behavior, or improper application termination. Its not really unique or specific to GH200.
The suggestion that “always” works is to reboot the system.
Other suggestions that may work are to try terminating remaining pieces of previous activity (killing processes that need to be killed to release resources.)
With a bit of searching you can find other questions like this, along with their associated suggestions. Here is one, there are others.
It’s probably worth a try. I haven’t tried it. Not all GPUs have MIG capability (in the general case), but its worth a try for the Hopper GPU(s) in view here.
Regardless if this is an old topic, I’m adding for reference. What worked for us was either to:
stop all processes that were currently accessing the gpu (stop all exporters, dcgmi, nv-persistenced, …) the last one freed also the memory (since persistence mode went off) as soon as the driver unloaded cleared also the memory, or
more interestingly, to execute a clear caches (no stopping of processes or reboots were required in this case)