Driver 460.x Hangs Causing Zombie Process

I have 4 RTX3090s that I use for deep learning. When the program exits, after training for a while, the process becomes a zombie process for about 10 mintues (also making it so I can’t launch any new GPU processes). While the process is still a zombie nvidia-smi also hangs for quite a while. I’m not sure what the best way to debug this is, but I have noticed that a bunch of errors show up after training in dmesg. Here is the output of that command: dmesg.txt (229.1 KB) . I have tried googling some of these error messages but I couldn’t find anything useful. After the program exits, here is the output confirming the process is a zombie:

$ ps -v 5255
   PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
  5255 pts/0    Zl    95:14   3614     0     0     0  0.0 [python3] <defunct>

The reason I suspect this is a driver issue is that the process is not killable. Running sudo kill -9 5255 does nothing.

This issue is present in driver version 460.73.01 but I upgraded today from a previous 460.x. I don’t know if it is helpful but this is with cuda version 11.1, cudnn version 8.1.0 (and tested with version 2.4, 2.5rc2, and tf-nightly of tensorflow) and Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-72-generic x86_64).

Can anyone help figure out what is going on? It is prohibatively slow to debug and implement/test new things because every time a program exits, the zombie process sticks around for a good 10-15 min. And I have to wait for it to quit before I can run the next thing.

Here is the bug report output: nvidia-bug-report.log.gz (5.1 MB)

Please try disabling iommu.

That seems to have done the trick! Thanks!