I’m using a server with 8 A100 right now and I’m performing parameter search of a DL model. With the current library I’m using, I often find an anonymous process was leaving behind and eating up all the memory after it was killed. The problem is very similar to here.
When checked with
nvidia-smi, the memory of GPU7 is being (almost) fully occupied and OOM is generated anytime when new model is started training.
But note that
nvidia-smi --query-compute-apps=pid --format=csv,noheaderdoes show an unknown process that wasn’t shown in the image above. However,
psdoes not work for the process with the unknown process (does not exist when
In the post above,
fuser -k /dev/nvidia* worked me very well. However, it also killed every process that is using the GPU, including those which weren’t on GPU7 as well. (I of course had modified the command to
fuser -k /dev/nvidia7.
So I’m very stuck right now cause that means I will have a GPUs that I cannot use as long as other process is running because I cannot target kill a process.
So after some digging, I had a few questions (not sure if it’s related).
- Most importantly, how do I find the culprit and killed it?
- Why a file descriptor is created for all GPUs even if CUDA_VISIBLE_DEVICES is set to only 1 of the GPUs to the process? Just out of curiosity.