Impossible to get handle for device number 3: Unknown Error

OS: Ubuntu 18.04
Driver Version: 470.94
GPUs: 4 GTX 2080ti

Hi, when I run pytorch programs, the GPU often occurs error “Impossible to get handle for device number 3: Unknown Error”, I’m not sure whether is the GPU memory error or PCIE slot error.

Here is the detailed info of bug report
nvidia-bug-report.log.gz (878.7 KB)

NVRM: Xid (PCI:0000:0a:00): 79, pid=31346, GPU has fallen off the bus.

Since it’s always a different gpu which runs into this, please check for overheating, monitor temperatures, make sure airflow is not blocked.
Also check for insufficient power supply.

Thanks, I’ll check the temperature and power supply. And is it possible that there are some hardware faults like GPU memory error?

There are also a lot of AER messages from the pci bridge, so there might also be an issue with the mainboard.