How to kill unknown process that eating up the GPU memory?

fungnaichit · December 22, 2022, 12:32pm

I’m using a server with 8 A100 right now and I’m performing parameter search of a DL model. With the current library I’m using, I often find an anonymous process was leaving behind and eating up all the memory after it was killed. The problem is very similar to here.

When checked with nvidia-smi, the memory of GPU7 is being (almost) fully occupied and OOM is generated anytime when new model is started training.

But note that nvidia-smi --query-compute-apps=pid --format=csv,noheader does show an unknown process that wasn’t shown in the image above. However, ps does not work for the process with the unknown process (does not exist when ps aux)

In the post above, fuser -k /dev/nvidia* worked me very well. However, it also killed every process that is using the GPU, including those which weren’t on GPU7 as well. (I of course had modified the command to fuser -k /dev/nvidia7.

So I’m very stuck right now cause that means I will have a GPUs that I cannot use as long as other process is running because I cannot target kill a process.

So after some digging, I had a few questions (not sure if it’s related).

Most importantly, how do I find the culprit and killed it?
Why a file descriptor is created for all GPUs even if CUDA_VISIBLE_DEVICES is set to only 1 of the GPUs to the process? Just out of curiosity.

flindeberg · January 30, 2023, 6:18pm

Did you manage to solve your issue? I am having the same issue right now.

fungnaichit · February 1, 2023, 10:35am

Sadly, no. I just killed all of them after waiting the remaining processes to finish.

Topic		Replies	Views
11 GB of GPU RAM used, and no process listed by nvidia-smi CUDA Programming and Performance	17	145034	September 22, 2023
10 GB of GPU RAM used, and no process listed by nvidia-smi CUDA Setup and Installation cuda , nvbugs , pytorch	1	3160	June 15, 2023
Provide solution for "GPU MEM used by PID but no GPU LOAD" DGX User Forum monitoring	2	1903	January 10, 2023
Memory on GPU not freed after the program is killed. CUDA Programming and Performance	0	886	June 15, 2018
No Process in GPU but GPU memory-usage is full; CUDA Setup and Installation	1	4973	March 28, 2021
Strange memory consumption on the device CUDA Programming and Performance	9	3514	July 13, 2017
per-process resource accounting CUDA Programming and Performance	2	2730	December 22, 2022
How to see which process is loading the GPU? Linux	2	1100	March 30, 2023
how to effectively free large memory allocation CUDA Programming and Performance	8	7628	November 5, 2015
nvidia-smi Volatile GPU-Util 100%, always, reboot operating system can not fix CUDA Setup and Installation	6	11216	November 30, 2020

How to kill unknown process that eating up the GPU memory?

Related topics