GPU Sporadically Falls Off Bus During Tensorflow Training

I have a system with four RTX 2080s that I’m using to train and test tensorflow models. Recently it has started to become unstable. Every now and then during training, one of the GPUs falls off the bus and becomes unusable until the system reboots. The error in dmesg looks like:

[64190.200239] NVRM: GPU at PCI:0000:67:00: GPU-7c922b92-ce48-5d3a-06eb-ef8a6c91ae74
[64190.200240] NVRM: GPU Board Serial Number:
[64190.200241] NVRM: Xid (PCI:0000:67:00): 79, pid=20833, GPU has fallen off the bus.
[64190.200259] NVRM: GPU 0000:67:00.0: GPU has fallen off the bus.
[64190.200260] NVRM: GPU 0000:67:00.0: GPU is on Board .
[64190.200270] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

Initially I thought the issue was power or thermal-related, but I can’t reproduce the issue by running stress tests (e.g. gpu-burn), and the times of failures don’t seem to obviously correspond to periods where the system is under high load.

Bug Report Logs: nvidia-bug-report.log.gz (770.2 KB)

According to the logs, it’s always the same gpu at pci 67:00.0 that’s falling off the bus. Please try reseating it in its slot, reseat power connectors, log temperatures. You might also swap cards to check if this is slot/position dependant, otherwise the gpu might be failing.

pci 67:00.0 definitely seems to be most common, but I have seen a different GPU fall off on at least one previous occasion:

[532755.876235] NVRM: GPU Board Serial Number:
[532755.876239] NVRM: Xid (PCI:0000:1a:00): 79, pid=2801777, GPU has fallen off the bus.
[532755.876243] NVRM: GPU 0000:1a:00.0: GPU has fallen off the bus.
[532755.876246] NVRM: GPU 0000:1a:00.0: GPU is on Board .