How to kill unknown process that eating up the GPU memory?

I’m using a server with 8 A100 right now and I’m performing parameter search of a DL model. With the current library I’m using, I often find an anonymous process was leaving behind and eating up all the memory after it was killed. The problem is very similar to here.

When checked with nvidia-smi, the memory of GPU7 is being (almost) fully occupied and OOM is generated anytime when new model is started training.

But note that nvidia-smi --query-compute-apps=pid --format=csv,noheader does show an unknown process that wasn’t shown in the image above. However, ps does not work for the process with the unknown process (does not exist when ps aux)

In the post above, fuser -k /dev/nvidia* worked me very well. However, it also killed every process that is using the GPU, including those which weren’t on GPU7 as well. (I of course had modified the command to fuser -k /dev/nvidia7.

So I’m very stuck right now cause that means I will have a GPUs that I cannot use as long as other process is running because I cannot target kill a process.

So after some digging, I had a few questions (not sure if it’s related).

  1. Most importantly, how do I find the culprit and killed it?
  2. Why a file descriptor is created for all GPUs even if CUDA_VISIBLE_DEVICES is set to only 1 of the GPUs to the process? Just out of curiosity.

Did you manage to solve your issue? I am having the same issue right now.

Sadly, no. I just killed all of them after waiting the remaining processes to finish.