GH200 memory not clearing

I am trying to run some ML models on a GH200 GPU using CuPy, CuML and CuDF in Python. However, the memory in the GPU is almost filled. No processes are running as well. Since my program deals with a lot of memory, I think it allocates the memory but then doesn’t clear it. I have already used del and gc.collect() in my code but that doesn’t seem to be working.

Additionally, now the memory is almost all filled even when there are no processes running. This is the output I get when I run nvidia-smi:

Because of this, My program crashes almost right away when it starts (see logs below):

2024-06-05 20:26:39,190 - ERROR - processing of <FILE_NAME> with exception std::bad_alloc: out_of_memory: CUDA error at: /home/<DIR>/rapids-24.02/include/rmm/mr/device/cuda_memory_resource.hpp

This is a common issue. It results from previous, improper application behavior, or improper application termination. Its not really unique or specific to GH200.

The suggestion that “always” works is to reboot the system.

Other suggestions that may work are to try terminating remaining pieces of previous activity (killing processes that need to be killed to release resources.)

With a bit of searching you can find other questions like this, along with their associated suggestions. Here is one, there are others.

Just an idea: Would the MIG (Multi-Instance GPU) feature with a single instance be a possible solution to better guarantee freed resources?

It’s probably worth a try. I haven’t tried it. Not all GPUs have MIG capability (in the general case), but its worth a try for the Hopper GPU(s) in view here.