GPU Sporadically Falls Off Bus During Tensorflow Training

ssanderson · May 3, 2021, 3:56pm

I have a system with four RTX 2080s that I’m using to train and test tensorflow models. Recently it has started to become unstable. Every now and then during training, one of the GPUs falls off the bus and becomes unusable until the system reboots. The error in dmesg looks like:

[64190.200239] NVRM: GPU at PCI:0000:67:00: GPU-7c922b92-ce48-5d3a-06eb-ef8a6c91ae74
[64190.200240] NVRM: GPU Board Serial Number:
[64190.200241] NVRM: Xid (PCI:0000:67:00): 79, pid=20833, GPU has fallen off the bus.
[64190.200259] NVRM: GPU 0000:67:00.0: GPU has fallen off the bus.
[64190.200260] NVRM: GPU 0000:67:00.0: GPU is on Board .
[64190.200270] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

Initially I thought the issue was power or thermal-related, but I can’t reproduce the issue by running stress tests (e.g. gpu-burn), and the times of failures don’t seem to obviously correspond to periods where the system is under high load.

Bug Report Logs: nvidia-bug-report.log.gz (770.2 KB)

generix · May 3, 2021, 4:02pm

According to the logs, it’s always the same gpu at pci 67:00.0 that’s falling off the bus. Please try reseating it in its slot, reseat power connectors, log temperatures. You might also swap cards to check if this is slot/position dependant, otherwise the gpu might be failing.

ssanderson · May 3, 2021, 4:05pm

pci 67:00.0 definitely seems to be most common, but I have seen a different GPU fall off on at least one previous occasion:

[532755.876235] NVRM: GPU Board Serial Number:
[532755.876239] NVRM: Xid (PCI:0000:1a:00): 79, pid=2801777, GPU has fallen off the bus.
[532755.876243] NVRM: GPU 0000:1a:00.0: GPU has fallen off the bus.
[532755.876246] NVRM: GPU 0000:1a:00.0: GPU is on Board .

Topic		Replies	Views
GeForce GTX 1060 reliably falls of the bus Linux cuda , tensorflow , ubuntu	1	559	May 19, 2020
Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus Linux cuda , tensorflow , ubuntu , linux	6	4283	December 26, 2021
GTX 1080 Ti falling off bus Linux	19	2472	September 3, 2018
2080 Ti "fallen off the bus" on ubuntu 18.04 Linux	1	487	August 14, 2019
GPU fallen off bus Linux ubuntu , gpu , debugging-and-troubleshooting	2	1331	May 27, 2022
GPU at 0000:02:00.0 has fallen off the bus. CUDA Programming and Performance	6	9001	November 28, 2011
Tesla K10 "has fallen off the bus" Linux	5	3268	May 13, 2013
kernel: [7766925.279896] NVRM: GPU at 0000:89:00.0 has fallen off the bus Linux	1	1055	November 18, 2016
GPU Lost when using Tensorflow Training Linux	0	447	March 6, 2019
While training with tensorflow RTX8000 with NVLINK loses with error message. "GPU has fallen off the bus." Linux cuda , tensorflow , ubuntu , driver	1	1603	May 11, 2021

GPU Sporadically Falls Off Bus During Tensorflow Training

Related topics