I have 4 RTX3090s that I use for deep learning. When the program exits, after training for a while, the process becomes a zombie process for about 10 mintues (also making it so I can’t launch any new GPU processes). While the process is still a zombie
nvidia-smi also hangs for quite a while. I’m not sure what the best way to debug this is, but I have noticed that a bunch of errors show up after training in
dmesg. Here is the output of that command: dmesg.txt (229.1 KB) . I have tried googling some of these error messages but I couldn’t find anything useful. After the program exits, here is the output confirming the process is a zombie:
$ ps -v 5255 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 5255 pts/0 Zl 95:14 3614 0 0 0 0.0 [python3] <defunct>
The reason I suspect this is a driver issue is that the process is not killable. Running
sudo kill -9 5255 does nothing.
This issue is present in driver version
460.73.01 but I upgraded today from a previous
460.x. I don’t know if it is helpful but this is with cuda version 11.1, cudnn version 8.1.0 (and tested with version 2.4, 2.5rc2, and tf-nightly of tensorflow) and
Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-72-generic x86_64).
Can anyone help figure out what is going on? It is prohibatively slow to debug and implement/test new things because every time a program exits, the zombie process sticks around for a good 10-15 min. And I have to wait for it to quit before I can run the next thing.
Here is the bug report output: nvidia-bug-report.log.gz (5.1 MB)