GPU (4090) falls off the bus, Linux desktop

ganatraad · March 21, 2024, 8:44pm

Description

First and foremost, I appreciate any help/guidance that anyone can provide me. I am pretty much at my wits ends.

I have a dual GPU desktop (System76). One of the two gpus (one in the top slot, both are 4090) will randomly fall off the bus. I haven’t been able to correlate it to any load. It can happen when all I am doing is reading email, or it might happen when I am running a machine learning program. Sometimes the system will go a few days without any problems, at the moment it has fallen off 3 times in the past 24 hours. And the only way out after that is to press off the power button. Rebooting from the GUI or the command line hangs.

The system is fairly new (about 6 months). It has been shipped back to the manufacturer multiple times and returned with the comment that they could not replicate the problem and hence cannot repair.

I am linking the output of nvidia-bug-report.sh to this topic in case it is helpful. The link is: Output of nvidia-bug-report.sh

The relevant section of dmesg is as follows:

[154565.721840] NVRM: GPU at PCI:0000:41:00: GPU-9acb9e54-2e15-8cdf-829f-c07a633c9f96
[154565.721845] NVRM: Xid (PCI:0000:41:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[154565.721847] NVRM: GPU 0000:41:00.0: GPU has fallen off the bus.
[154568.708290] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
[154568.708298] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[154568.708301] {1}[Hardware Error]: event severity: corrected
[154568.708304] {1}[Hardware Error]:  Error 0, type: corrected
[154568.708306] {1}[Hardware Error]:  fru_text: PcieError
[154568.708309] {1}[Hardware Error]:   section_type: PCIe error
[154568.708312] {1}[Hardware Error]:   port_type: 4, root port
[154568.708314] {1}[Hardware Error]:   version: 0.2
[154568.708317] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
[154568.708320] {1}[Hardware Error]:   device_id: 0000:40:01.1
[154568.708324] {1}[Hardware Error]:   slot: 0
[154568.708326] {1}[Hardware Error]:   secondary_bus: 0x41
[154568.708328] {1}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1483
[154568.708331] {1}[Hardware Error]:   class_code: 060400
[154568.708334] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0012
[154568.709979] pcieport 0000:40:01.1: AER: aer_status: 0x00000040, aer_mask: 0x00000000
[154568.709987] pcieport 0000:40:01.1:    [ 6] BadTLP                
[154568.709993] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
[155251.000557] snd_hda_codec_hdmi hdaudioC2D0: HDMI: invalid ELD buf size -1

(base) aganatra@system76-pc:~/Documents/system-logs$ nvidia-smi -i 0
Unable to determine the device handle for GPU0000:41:00.0: Unknown Error

Environment

TensorRT Version: NA
GPU Type: NVIDIA GeForce RTX 4090
Nvidia Driver Version: 545.29.06

CUDA Version:
CUDNN Version:
Operating System + Version: Linux system76-pc 6.6.10-76060610-generic #202401051437~1704728131~22.04~24d69e2~dev-Ubuntu SMP PREEMPT_DY x86_64 x86_64 x86_64 GNU/Linux

Relevant Files

Output of nvidia-bug-report.sh

5hacf6up3 · May 22, 2024, 7:25am

Hello, I am also having this same issue with a 3060 on my Ubuntu 22.04 Desktop.
The event happens randomly and at least once a day. Usually I am just browsing the internet or something. Ironically, it has never happened when the GPU is under load like during a game or messing with stable diffusion.

[58482.425209] NVRM: GPU at PCI:0000:01:00: GPU-443f2425-b23c-51aa-c307-a690275009c6
[58482.425240] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[58482.425249] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[58482.425253] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

I have tried:

Modifying the GRUB_CMDLINE_LINUX_DEFAULT variable and disabling whatever power management properties
Replacing the riser cable
Upgrading the PSU (850W)

Specs

NVIDIA GeForce RTX 3060
Driver Version: 535.171.04
CUDA Version: 12.2
Linux desktop 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Had trouble with the file uploader here, so I uploaded it to pixeldrain
https://pixeldrain.com/u/PNXoXbqu

5hacf6up3 · June 19, 2024, 7:37pm

Updating my bios seems to have fixed the issue for me. I haven’t had the “GPU has fallen off the bus” error in a few days.
My motherboard is an Asus ROG STRIX B760-I GAMING WIFI. The bios version that was originally installed was from July of 2023. I update to their latest build (May 2024) using their flasher tool built into the bios controls.

Hope that helps someone else in the future.

Topic		Replies	Views
RTX4090 - GPU fans to max and "GPU has fallen off the bus" Linux	3	678	November 12, 2025
GPU has fallen off the bus issues on daily basis (RTX 4090) Linux pcie , cuda , ubuntu , rtx	9	3980	April 26, 2025
Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus Linux cuda , tensorflow , ubuntu , linux	6	4364	December 26, 2021
GPU has fallen off the bus Linux	0	363	August 20, 2024
Please Help Another NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. RTX4090 Linux	1	1300	August 18, 2023
Bug Report - 'GPU has fallen off the bus' randomly; NVIDIA GeForce RTX 4090 + NVIDIA GeForce RTX 5090 D dual setup Linux hw , ubuntu	0	84	March 8, 2026
GPU fallen off bus Linux ubuntu , gpu , debugging-and-troubleshooting	2	1395	May 27, 2022
GPU Sporadically Falls Off Bus During Tensorflow Training Linux	2	675	May 3, 2021
79, GPU has fallen off the bus (RTX 2000) Linux rtx	6	264	August 18, 2025
Issue of GPU has fallen off the bus Linux	3	374	April 3, 2025

GPU (4090) falls off the bus, Linux desktop

Description

Environment

Relevant Files

Specs

Related topics