Please Help Another NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. RTX4090

user126741 · June 21, 2023, 1:28pm

I have been struggling to find the root cause of this problem for months without success. I have tried all suggestions over this forum and elsewhere. I would appreciate any advice that could help me find out the cause of the problem and I am happy to pay you for the time you spend on this. Thank you in advance.

Problem description: GPU card consistently “fallen off the bus” after training an AI model after few epochs, regardless of model size, GPU memory usage and GPU utilization. It only happens when GPU is under some utilization. If GPU is left idle it does not happen.

Hardware:
CPU:
vendor_id : GenuineIntel
cpu family : 6
model : 183
model name : 13th Gen Intel(R) Core™ i7-13700KF

Motherboard: Base Board Information
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: PRIME Z790-A WIFI
BIOS version: 1010 (latest)
GPU: MSI GeForce RTX 4090
PSU: 1500W
Software: Ubuntu 22.04 + lambda stack
NVIDIA driver: NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0
Linux Kernel: 5.19.0-45-generic

I have tried at least the following:
-Try different HDMI and DP cables and monitor
-Try with Xserver on and off
-Update BIOS to the latest version
-Update system
-Turn nvidia persistence mode on
-Limit maximum GPU clock to 2000MHz
-Try different kernel parameters, e.g. pci=noaer, pci=nommconf
-Run Python code within a docker environment
-Ensure that the rated power of the PSU is quite sufficient for the GPU - 1500W PSU
-Ensure that ventilation is good - maximum GPU temperature reported is around 65 degrees Celcius.

What I would like to know:
-Do I have a bad motherboard?
-Do I have a bad RTX4090 card?
-Do I have a bad power supply?
-Is it a software or hardware problem?
-Any suggestions?

Typical dmesg output
[Wed Jun 21 14:25:01 2023] docker0: port 1(veth23ae11e) entered blocking state
[Wed Jun 21 14:25:01 2023] docker0: port 1(veth23ae11e) entered disabled state
[Wed Jun 21 14:25:01 2023] device veth23ae11e entered promiscuous mode
[Wed Jun 21 14:25:01 2023] eth0: renamed from vethb07af9e
[Wed Jun 21 14:25:02 2023] IPv6: ADDRCONF(NETDEV_CHANGE): veth23ae11e: link becomes ready
[Wed Jun 21 14:25:02 2023] docker0: port 1(veth23ae11e) entered blocking state
[Wed Jun 21 14:25:02 2023] docker0: port 1(veth23ae11e) entered forwarding state
[Wed Jun 21 14:25:02 2023] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
[Wed Jun 21 14:46:37 2023] pcieport 0000:00:01.0: AER: Multiple Corrected error received: 0000:00:01.0
[Wed Jun 21 14:46:37 2023] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[Wed Jun 21 14:46:37 2023] pcieport 0000:00:01.0: device [8086:a70d] error status/mask=00002001/00002000
[Wed Jun 21 14:46:37 2023] pcieport 0000:00:01.0: [ 0] RxErr
[Wed Jun 21 14:46:37 2023] NVRM: GPU at PCI:0000:01:00: GPU-efe80efb-a171-6dd3-7cfc-e5ea3ea54439
[Wed Jun 21 14:46:37 2023] NVRM: Xid (PCI:0000:01:00): 79, pid=‘’, name=, GPU has fallen off the bus.
[Wed Jun 21 14:46:37 2023] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[Wed Jun 21 14:48:20 2023] nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
[Wed Jun 21 14:48:20 2023] nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
[Wed Jun 21 14:48:20 2023] nvidia-modeset: WARNING: GPU:0: Failure processing EDID for display device DELL P2213 (DP-2).
[Wed Jun 21 14:48:20 2023] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DELL P2213 (DP-2)
[Wed Jun 21 14:48:20 2023] nvidia-modeset: ERROR: GPU:0: Failure reading maximum pixel clock value for display device DP-2.
[Wed Jun 21 14:48:20 2023] nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
[Wed Jun 21 14:48:20 2023] nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices

Bug Reports obtained from nvidia-bug-report.sh
nvidia-bug-report.log.gz (191.2 KB)
nvidia-bug-report.log1.gz (161.9 KB)
nvidia-bug-report.log2.gz (164.1 KB)
nvidia-bug-report.log3.gz (155.0 KB)

user126741 · August 18, 2023, 8:35am

I am not sure if this will help others, but it looks like the GPU card I had was defective.
I have replaced it with another recently purchased GPU card (made by another brand) and there is no issue since then.

Topic		Replies	Views
Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus Linux cuda , tensorflow , ubuntu , linux	6	4237	December 26, 2021
GPU at 0000:02:00.0 has fallen off the bus. CUDA Programming and Performance	6	8954	November 28, 2011
nVidia card has fallen off the bus CUDA Setup and Installation	1	1568	April 16, 2013
kernel: [7766925.279896] NVRM: GPU at 0000:89:00.0 has fallen off the bus Linux	1	1025	November 18, 2016
Ubuntu 17.10, Nvidia 390.48, CUDA 9.1, GPU has fallen off the bus Linux	1	1929	April 24, 2018
GPU Sporadically Falls Off Bus During Tensorflow Training Linux	2	631	May 3, 2021
GPU has fallen off the bus GPU - Hardware	0	984	October 25, 2019
RTX4090 - GPU fans to max and "GPU has fallen off the bus" Linux	0	536	July 7, 2023
GTX 1080 Ti falling off bus Linux	19	2380	September 3, 2018
GPU has fallen of the bus, nvidia-361.28, kernel 4.2.0 Linux	1	1597	February 28, 2016

Please Help Another NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. RTX4090

Related topics