GPU (4090) falls off the bus, Linux desktop

Description

First and foremost, I appreciate any help/guidance that anyone can provide me. I am pretty much at my wits ends.

I have a dual GPU desktop (System76). One of the two gpus (one in the top slot, both are 4090) will randomly fall off the bus. I haven’t been able to correlate it to any load. It can happen when all I am doing is reading email, or it might happen when I am running a machine learning program. Sometimes the system will go a few days without any problems, at the moment it has fallen off 3 times in the past 24 hours. And the only way out after that is to press off the power button. Rebooting from the GUI or the command line hangs.

The system is fairly new (about 6 months). It has been shipped back to the manufacturer multiple times and returned with the comment that they could not replicate the problem and hence cannot repair.

I am linking the output of nvidia-bug-report.sh to this topic in case it is helpful. The link is: Output of nvidia-bug-report.sh

The relevant section of dmesg is as follows:

[154565.721840] NVRM: GPU at PCI:0000:41:00: GPU-9acb9e54-2e15-8cdf-829f-c07a633c9f96
[154565.721845] NVRM: Xid (PCI:0000:41:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[154565.721847] NVRM: GPU 0000:41:00.0: GPU has fallen off the bus.
[154568.708290] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
[154568.708298] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[154568.708301] {1}[Hardware Error]: event severity: corrected
[154568.708304] {1}[Hardware Error]:  Error 0, type: corrected
[154568.708306] {1}[Hardware Error]:  fru_text: PcieError
[154568.708309] {1}[Hardware Error]:   section_type: PCIe error
[154568.708312] {1}[Hardware Error]:   port_type: 4, root port
[154568.708314] {1}[Hardware Error]:   version: 0.2
[154568.708317] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
[154568.708320] {1}[Hardware Error]:   device_id: 0000:40:01.1
[154568.708324] {1}[Hardware Error]:   slot: 0
[154568.708326] {1}[Hardware Error]:   secondary_bus: 0x41
[154568.708328] {1}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1483
[154568.708331] {1}[Hardware Error]:   class_code: 060400
[154568.708334] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0012
[154568.709979] pcieport 0000:40:01.1: AER: aer_status: 0x00000040, aer_mask: 0x00000000
[154568.709987] pcieport 0000:40:01.1:    [ 6] BadTLP                
[154568.709993] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
[155251.000557] snd_hda_codec_hdmi hdaudioC2D0: HDMI: invalid ELD buf size -1
(base) aganatra@system76-pc:~/Documents/system-logs$ nvidia-smi -i 0
Unable to determine the device handle for GPU0000:41:00.0: Unknown Error

Environment

TensorRT Version: NA
GPU Type: NVIDIA GeForce RTX 4090
Nvidia Driver Version: 545.29.06

CUDA Version:
CUDNN Version:
Operating System + Version: Linux system76-pc 6.6.10-76060610-generic #202401051437~1704728131~22.04~24d69e2~dev-Ubuntu SMP PREEMPT_DY x86_64 x86_64 x86_64 GNU/Linux

Relevant Files

Output of nvidia-bug-report.sh

Hello, I am also having this same issue with a 3060 on my Ubuntu 22.04 Desktop.
The event happens randomly and at least once a day. Usually I am just browsing the internet or something. Ironically, it has never happened when the GPU is under load like during a game or messing with stable diffusion.

[58482.425209] NVRM: GPU at PCI:0000:01:00: GPU-443f2425-b23c-51aa-c307-a690275009c6
[58482.425240] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[58482.425249] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[58482.425253] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

I have tried:

  • Modifying the GRUB_CMDLINE_LINUX_DEFAULT variable and disabling whatever power management properties
  • Replacing the riser cable
  • Upgrading the PSU (850W)

Specs

  • NVIDIA GeForce RTX 3060
  • Driver Version: 535.171.04
  • CUDA Version: 12.2
  • Linux desktop 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Had trouble with the file uploader here, so I uploaded it to pixeldrain

Updating my bios seems to have fixed the issue for me. I haven’t had the “GPU has fallen off the bus” error in a few days.
My motherboard is an Asus ROG STRIX B760-I GAMING WIFI. The bios version that was originally installed was from July of 2023. I update to their latest build (May 2024) using their flasher tool built into the bios controls.

Hope that helps someone else in the future.