Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus

I am trying to train a machine learning model using Tensorflow on my Ubuntu 20.04 server with Cuda 11.2 and CuDNN 8.1 installed. Unfortunately the GPU crashes and falls of the bus as can be seen by running the dmesg command:

[  517.195242] NVRM: GPU at PCI:0000:0a:00: GPU-7a2f2bd6-a848-bf8e-0541-09ef347fba71
[  517.195246] NVRM: GPU Board Serial Number: 1322721012372
[  517.195248] NVRM: Xid (PCI:0000:0a:00): 79, pid=0, GPU has fallen off the bus.
[  517.195274] NVRM: GPU 0000:0a:00.0: GPU has fallen off the bus.
[  517.195276] NVRM: GPU 0000:0a:00.0: GPU is on Board 1322721012372.
[  517.195290] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

In my debugging attempts, I manually tested that the GPU does not crash due to:

  • the lack of power, by limiting the power usage to 250W via nvidia-smi -pl 250.
  • overheating, by monitoring the temperature via nvidia-smi --query-gpu=timestamp,temperature.gpu, which never crossed 80 degrees
  • an out-of-memory error of the GPU, via nvidia-smi --query-gpu=timestamp,memory-free, which was at its minimum 600MB
  • a problem with my RAM, by running memtester multiple times.

What is the reason for the GPU falling off the bus? For me this seems to be a hardware problem?

nvidia-bug-report.log.gz (262.8 KB)

Using nvidia-smi -pl is not a viable method to rule out power issues since the limiter does not work instantanious so still allows for power spikes during gpu boost. Please try limiting clocks instead, e.g.:
nvidia-smi -lgc 300,1800

I can exclude a power issue as well. I reduced the clock speed via nvidia-smi -lgc 300,1800 and monitored the power consumption:

timestamp, temperature.gpu, power.draw [W], clocks.current.sm [MHz], clocks.current.memory [MHz], clocks.current.graphics [MHz]
2021/12/06 19:45:36.624, 32, 28.27 W, 300 MHz, 405 MHz, 300 MHz
2021/12/06 19:52:39.711, 50, 152.64 W, 1800 MHz, 9501 MHz, 1800 MHz
2021/12/06 19:52:40.713, 50, 152.57 W, 1800 MHz, 9501 MHz, 1800 MHz
2021/12/06 19:52:41.715, 50, 152.50 W, 1800 MHz, 9501 MHz, 1800 MHz
2021/12/06 19:52:42.718, [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost]
2021/12/06 19:52:43.718, [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost]
2021/12/06 19:52:44.719, [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost], [GPU is lost]

gpu.csv (30.2 KB)

As you can see the Power is running fine for >7m. I am also using a high-end Corsair AX1600 (1600W) PSU.

Could this be a hardware issue? I will try to use a different PCI on my Aorus Master X570 Motherboard to exclude a hardware error of the motherboard. If this doesn’t resolve the problem I assume that the GPU hardware is faulty.

Yes, can be safely assumed neither temperature notr power being the problem. Did you already try to reseat the card in its slot, possibly multiple times to take care of dirt from manufacturing plant? The next steps would be checking for a bios update, checking the card in a different slot, possibly a different system to check for a general hw defect.

I updated the BIOS to its latest version (F35e) and reseated the card multiple times in a different slot, but the error persists.
I don’t have a different system at hand. Is there a way to send the card in for inspection?

You can only send it back to the vendor if still in warranty.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.