I have been struggling to find the root cause of this problem for months without success. I have tried all suggestions over this forum and elsewhere. I would appreciate any advice that could help me find out the cause of the problem and I am happy to pay you for the time you spend on this. Thank you in advance.
Problem description: GPU card consistently “fallen off the bus” after training an AI model after few epochs, regardless of model size, GPU memory usage and GPU utilization. It only happens when GPU is under some utilization. If GPU is left idle it does not happen.
Hardware:
CPU:
vendor_id : GenuineIntel
cpu family : 6
model : 183
model name : 13th Gen Intel(R) Core™ i7-13700KF
Motherboard: Base Board Information
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: PRIME Z790-A WIFI
BIOS version: 1010 (latest)
GPU: MSI GeForce RTX 4090
PSU: 1500W
Software: Ubuntu 22.04 + lambda stack
NVIDIA driver: NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0
Linux Kernel: 5.19.0-45-generic
I have tried at least the following:
-Try different HDMI and DP cables and monitor
-Try with Xserver on and off
-Update BIOS to the latest version
-Update system
-Turn nvidia persistence mode on
-Limit maximum GPU clock to 2000MHz
-Try different kernel parameters, e.g. pci=noaer, pci=nommconf
-Run Python code within a docker environment
-Ensure that the rated power of the PSU is quite sufficient for the GPU - 1500W PSU
-Ensure that ventilation is good - maximum GPU temperature reported is around 65 degrees Celcius.
What I would like to know:
-Do I have a bad motherboard?
-Do I have a bad RTX4090 card?
-Do I have a bad power supply?
-Is it a software or hardware problem?
-Any suggestions?
Typical dmesg output
[Wed Jun 21 14:25:01 2023] docker0: port 1(veth23ae11e) entered blocking state
[Wed Jun 21 14:25:01 2023] docker0: port 1(veth23ae11e) entered disabled state
[Wed Jun 21 14:25:01 2023] device veth23ae11e entered promiscuous mode
[Wed Jun 21 14:25:01 2023] eth0: renamed from vethb07af9e
[Wed Jun 21 14:25:02 2023] IPv6: ADDRCONF(NETDEV_CHANGE): veth23ae11e: link becomes ready
[Wed Jun 21 14:25:02 2023] docker0: port 1(veth23ae11e) entered blocking state
[Wed Jun 21 14:25:02 2023] docker0: port 1(veth23ae11e) entered forwarding state
[Wed Jun 21 14:25:02 2023] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
[Wed Jun 21 14:46:37 2023] pcieport 0000:00:01.0: AER: Multiple Corrected error received: 0000:00:01.0
[Wed Jun 21 14:46:37 2023] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[Wed Jun 21 14:46:37 2023] pcieport 0000:00:01.0: device [8086:a70d] error status/mask=00002001/00002000
[Wed Jun 21 14:46:37 2023] pcieport 0000:00:01.0: [ 0] RxErr
[Wed Jun 21 14:46:37 2023] NVRM: GPU at PCI:0000:01:00: GPU-efe80efb-a171-6dd3-7cfc-e5ea3ea54439
[Wed Jun 21 14:46:37 2023] NVRM: Xid (PCI:0000:01:00): 79, pid=‘’, name=, GPU has fallen off the bus.
[Wed Jun 21 14:46:37 2023] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[Wed Jun 21 14:48:20 2023] nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
[Wed Jun 21 14:48:20 2023] nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
[Wed Jun 21 14:48:20 2023] nvidia-modeset: WARNING: GPU:0: Failure processing EDID for display device DELL P2213 (DP-2).
[Wed Jun 21 14:48:20 2023] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DELL P2213 (DP-2)
[Wed Jun 21 14:48:20 2023] nvidia-modeset: ERROR: GPU:0: Failure reading maximum pixel clock value for display device DP-2.
[Wed Jun 21 14:48:20 2023] nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
[Wed Jun 21 14:48:20 2023] nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
Bug Reports obtained from nvidia-bug-report.sh
nvidia-bug-report.log.gz (191.2 KB)
nvidia-bug-report.log1.gz (161.9 KB)
nvidia-bug-report.log2.gz (164.1 KB)
nvidia-bug-report.log3.gz (155.0 KB)