10 GB of GPU RAM used, and no process listed by nvidia-smi

Currently, the phenomenon shown in the figure below has occurred.

The timing of occurrence is when the reference is finished using pytorch, but the GPU is still being held. So when you turn the reference again, you get “cuda out of memory.”

We looked for similar phenomena as above, and the most similar results are shown on the page below.
(11 GB of GPU RAM used, and no process listed by nvidia-smi)

The way I solved this problem is as follows.

$ sudo nvidia-smi --gpu-reset -i 0

GPU 00000000:02:02.0 is currently in use by another process.
1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.

for i in $(sudo lsof /dev/nvidia* | grep python | awk ‘{print $2}’ | sort -u); do kill -9 $i; done
→ Nothing changes

sudo fuser -v /dev/nvidia*
→ Nothing changes

Modify Pytorch Code

  • torch.Apply no_grad()
  • torch.cuda_empty_cache()

I tried the above four methods, but there was no process, but I couldn’t fix the GPU occupancy.

The last method is to reboot, but the same phenomenon occurred continuously when running the reference again upon reboot.

How can we solve this problem?

Thank you.