Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus

dirkhornung91 · December 5, 2021, 11:41am

I am trying to train a machine learning model using Tensorflow on my Ubuntu 20.04 server with Cuda 11.2 and CuDNN 8.1 installed. Unfortunately the GPU crashes and falls of the bus as can be seen by running the dmesg command:

[  517.195242] NVRM: GPU at PCI:0000:0a:00: GPU-7a2f2bd6-a848-bf8e-0541-09ef347fba71
[  517.195246] NVRM: GPU Board Serial Number: 1322721012372
[  517.195248] NVRM: Xid (PCI:0000:0a:00): 79, pid=0, GPU has fallen off the bus.
[  517.195274] NVRM: GPU 0000:0a:00.0: GPU has fallen off the bus.
[  517.195276] NVRM: GPU 0000:0a:00.0: GPU is on Board 1322721012372.
[  517.195290] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

In my debugging attempts, I manually tested that the GPU does not crash due to:

the lack of power, by limiting the power usage to 250W via nvidia-smi -pl 250.
overheating, by monitoring the temperature via nvidia-smi --query-gpu=timestamp,temperature.gpu, which never crossed 80 degrees
an out-of-memory error of the GPU, via nvidia-smi --query-gpu=timestamp,memory-free, which was at its minimum 600MB
a problem with my RAM, by running memtester multiple times.

What is the reason for the GPU falling off the bus? For me this seems to be a hardware problem?

nvidia-bug-report.log.gz (262.8 KB)

generix · December 6, 2021, 9:43am

Using nvidia-smi -pl is not a viable method to rule out power issues since the limiter does not work instantanious so still allows for power spikes during gpu boost. Please try limiting clocks instead, e.g.:
nvidia-smi -lgc 300,1800

dirkhornung91 · December 6, 2021, 8:14pm

I can exclude a power issue as well. I reduced the clock speed via nvidia-smi -lgc 300,1800 and monitored the power consumption:

timestamp, temperature.gpu, power.draw [W], clocks.current.sm [MHz], clocks.current.memory [MHz], clocks.current.graphics [MHz]
2021/12/06 19:45:36.624, 32, 28.27 W, 300 MHz, 405 MHz, 300 MHz
...
2021/12/06 19:52:39.711, 50, 152.64 W, 1800 MHz, 9501 MHz, 1800 MHz
2021/12/06 19:52:40.713, 50, 152.57 W, 1800 MHz, 9501 MHz, 1800 MHz
2021/12/06 19:52:41.715, 50, 152.50 W, 1800 MHz, 9501 MHz, 1800 MHz
2021/12/06 19:52:42.718, [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost]
2021/12/06 19:52:43.718, [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost]
2021/12/06 19:52:44.719, [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost]

gpu.csv (30.2 KB)

As you can see the Power is running fine for >7m. I am also using a high-end Corsair AX1600 (1600W) PSU.

Could this be a hardware issue? I will try to use a different PCI on my Aorus Master X570 Motherboard to exclude a hardware error of the motherboard. If this doesn’t resolve the problem I assume that the GPU hardware is faulty.

generix · December 6, 2021, 11:07pm

Yes, can be safely assumed neither temperature notr power being the problem. Did you already try to reseat the card in its slot, possibly multiple times to take care of dirt from manufacturing plant? The next steps would be checking for a bios update, checking the card in a different slot, possibly a different system to check for a general hw defect.

dirkhornung91 · December 11, 2021, 6:00pm

I updated the BIOS to its latest version (F35e) and reseated the card multiple times in a different slot, but the error persists.
I don’t have a different system at hand. Is there a way to send the card in for inspection?

generix · December 12, 2021, 2:50pm

You can only send it back to the vendor if still in warranty.

system · December 26, 2021, 2:50pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPU at 0000:02:00.0 has fallen off the bus. CUDA Programming and Performance	6	9001	November 28, 2011
GeForce GTX 1060 reliably falls of the bus Linux cuda , tensorflow , ubuntu	1	559	May 19, 2020
Ubuntu 17.10, Nvidia 390.48, CUDA 9.1, GPU has fallen off the bus Linux	1	1951	April 24, 2018
GPU Sporadically Falls Off Bus During Tensorflow Training Linux	2	652	May 3, 2021
Ubuntu 16.04 GTX 750 Ti GPU has fallen off the bus Linux	0	1617	December 26, 2016
GPU has fallen off the bus GPU - Hardware	0	995	October 25, 2019
Tesla K10 "has fallen off the bus" Linux	5	3268	May 13, 2013
Please Help Another NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. RTX4090 Linux	1	1254	August 18, 2023
kernel: [7766925.279896] NVRM: GPU at 0000:89:00.0 has fallen off the bus Linux	1	1055	November 18, 2016
GPU has fallen of the bus, nvidia-361.28, kernel 4.2.0 Linux	1	1635	February 28, 2016

Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus

Related topics