RTX4090 - GPU fans to max and "GPU has fallen off the bus"

I have a System76 Mira R3 machine with an RTX4090 GPU. It has been working great for about 3 weeks since I got it, but all of a sudden a couple of days ago it started having problems. The GPU fan will spin up to what sounds like max speed and the GPU will stop responding. nvidia-smi just gives an error “Unable to determine the device handle for GPU0000:01:00.0: Unknown Error” and I see this in the system log:

[Fri Jul  7 14:06:55 2023] NVRM: GPU at PCI:0000:01:00: GPU-a8a22861-ba33-c1b6-e2f7-ec993989ad48
[Fri Jul  7 14:06:55 2023] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Fri Jul  7 14:06:55 2023] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

Everything else is still working while this is going on, but when I try to shutdown the system hangs.

I’m attaching the log output from nvidia-bug-report.sh.

This is a Pop OS 22.04 Linux system that I just bought about a month ago. I am using the 4090 for machine learning model training. There are no monitors connected to the 4090. I also have a GT1030 in the system that is driving a second monitor and it’s having no problems. I haven’t been able to track down anything that is triggering it to happen. It’s happened while I’m building a model, it’s happened when I’m just reading email and it’s even happened when I’m not doing anything and the system is idle. Is there anything I can do to fix this or is the GPU itself bad and need to be replaced?

nvidia-bug-report.log.gz (254.7 KB)

Hi, I have the exact same issue (but with a 5070 ti). Did you manage to find out the cause?

Hi @kincaid.dave and @regunakyle ,
Thanks for reporting this issue.
Could you please help to attach a bug report with the latest r580 driver.

Hi @vanditd , this has also been happening to me but with dual RTX 5060 ti 16GB GPUs for the past month. It was working perfectly prior to mid-October, but even rolling back device drivers and kernels hasn’t resolved it.

Sometimes it happens immediately upon loading the nvidia drivers (via running nvidia-smi) from a cold boot, other times it’s when initiating a CUDA workload. It hasn’t happened when the GPU is in the first PCIe slot, it seems to only be happening to the GPU in the second slot. I’ve bought five RTX 5060 ti 16GB GPUs (1x Gigabyte, 4x PNY), and regardless of which GPU is in the second slot, it’s always the second slot that is falling off the bus. When any of the GPUs is by itself, it works as expected. I’ve tried re-seating the GPU many, many times; same behavior.

On some boots (very, very rare) the second GPU doesn’t even show up in the lspci listing.

I have replaced the original 750W PSU with an 850W PSU. I have installed the latest BIOS for the motherboard. The reason I have five GPUs is because I have two identical, dual-GPU builds for high availability; if one needs to go down, the VMs/Containers can be migrated to the backup. I’m experiencing the same behavior in both systems.

I’ve tried disabling ASPM in the BIOS and re-enabling it. I turned auto-negotiation of PCIe generation on and off. CSM is disabled. I’ve tried kernel command line parameters to no avail. I tried the new CUDA_DISABLE_PERF_BOOST environment variable, again, to no avail. I can’t configure power usage via nvidia-smi because the execution of that tool is most frequently the trigger for the GPU falling off the bus.

I’ve tried the NVIDIA drivers from 580.82.09 up to the latest 580.105.08; identical behavior. I’ve tried this on Debian 12 and 13 systems (specifically Proxmox 8.4 and 9.0) running Linux kernels 6.8.12 and 6.14.11, respectively.

I’m not sure how valuable the bug report will be, as the problematic GPU is not reachable for communication, but I’ve attached it regardless.

nvidia-bug-report.log.gz (457.6 KB)