I have a system with four RTX 2080s that I’m using to train and test tensorflow models. Recently it has started to become unstable. Every now and then during training, one of the GPUs falls off the bus and becomes unusable until the system reboots. The error in dmesg looks like:
[64190.200239] NVRM: GPU at PCI:0000:67:00: GPU-7c922b92-ce48-5d3a-06eb-ef8a6c91ae74
[64190.200240] NVRM: GPU Board Serial Number:
[64190.200241] NVRM: Xid (PCI:0000:67:00): 79, pid=20833, GPU has fallen off the bus.
[64190.200259] NVRM: GPU 0000:67:00.0: GPU has fallen off the bus.
[64190.200260] NVRM: GPU 0000:67:00.0: GPU is on Board .
[64190.200270] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Initially I thought the issue was power or thermal-related, but I can’t reproduce the issue by running stress tests (e.g. gpu-burn), and the times of failures don’t seem to obviously correspond to periods where the system is under high load.
Bug Report Logs: nvidia-bug-report.log.gz (770.2 KB)