"GPU has fallen off the bus" while idle, only occurs when all displays powered off

BCS · February 17, 2022, 6:54am

Updating with what I’ve learned after a week of testing, for the benefit of anyone else who might be having issues:

470.103.01 works the same as 470.94 for me under 5.15.x kernels. However, when I try to build and load the 470.103.01 driver under either 5.16.8 or 5.16.9, I immediately get the “GPU fallen off the bus” error and loss of signal to my monitors at the moment the driver loads. This is obviously under text mode (EFI framebuffer console) and happens while executing the NVIDIA-Linux-x86_64-470.103.01.run script, immediately after successfully building the modules. This behavior was 100% repeatable through multiple reboots on both kernels, and persisted across slight variations in kernel build and command-line parameters as described below. The 470.94 modules can be installed under 5.16.x on the same machines, but are still at risk of “GPU fallen off the bus” after extended idle periods.

The fact that the timing of the failure under 5.16.x/470.103.x is so consistent (after the card has been working fine up to that point), and that it doesn’t occur (immediately) under different kernel or driver combinations, has me fully convinced that this is indeed a driver issue and not just a flaky card, and likely a regression caused by some thread of development in recent drivers and/or kernels. After testing I’m back on a 5.15.18/470.94 combination with over three days of uptime with no problems, although I have not yet turned off the monitors.

Things I tried that did not help:

nvidia-smi -pm 1 (still got ‘Fallen off the bus’ errors afterward)
Kernel command-line option pcie_aspm=off (still got ‘Fallen off the bus’ errors in this configuration)
rcutree.rcu_idle_gp_delay=1 (still got ‘Fallen off the bus’ errors)
acpi_osi=! acpi_osi=‘Linux’ (hangs during boot)
Building the kernel with or without CONFIG_HOTPLUG_PCI, CONFIG_HOTPLUG_PCI_ACPI, and CONFIG_HOTPLUG_PCI_PCIE (experienced “GPU fallen off the bus” regardless).
Building with or without CONFIG_ACPI_FPDT (same).
Building with or without CONFIG_PARAVIRT, CONFIG_HYPERV, and CONFIG_XEN (changes some code paths inside the NVIDIA driver, but no apparent changes in behavior).
Building with or without CONFIG_PREEMPT (RT patches that became more mainstream in 5.15 and subsequent kernels; no reason to think this was an issue but I thought I’d rule it out.
Trying to build without CONFIG_X86_PAT (NVIDIA driver fails to compile against recent kernels in this mode even though the apparent point of nv_pat.c is to support this case, see separate bug report).

Things I still have not proven or disproven, but might be helping:

Upgrading from 470.94 to 470.103.01 on 5.15.18 kernel.
Swapped which monitor is primary and which is secondary in X (Display and Monitor section of KDE settings) to match which is marked as primary in the Display Configuration section of nvidia-settings.
Extended my “[monitors] switch off after” timer from 5 (the default) to 30 minutes in KDE settings (Power Management section; “Screen Energy Saving” is still enabled).
Changing the CONFIG_PCIEASPM mode selection: all of my previous failures (and under older NVidia drivers, successes) were with this set to “default”. The 5.15.18 kernel that’s running now was built with “performance” selected, which might prevent the GPU from going into a deep sleep state from which the NVidia driver can’t figure out how to communicate with it, if that’s indeed the problem. I want to leave it running a while longer to prove hardware stability before I try turning monitors off to see whether anything has actually changed. However, I was getting the immediate “fallen off the bus” errors when trying to load the 470.103.01 driver under 5.16.x kernels regardless of whether they were built with “default” or “performance” behavior.
Building the kernel with CONFIG_PCIE_EDR (Error Disconnect Recover)— same story as CONFIG_PCIEASPM_*; I never used this before but have a 5.15.18 kernel with it running now, although it didn’t help on 5.16.x.

Anyway, I’m pretty sure what I’m seeing is ultimately a driver regression or failure to keep up with PCIe/ASPM changes in the kernel, and I hope some of the above notes may be useful to others hitting the same problem. If anyone from the NVidia driver team actually reads this forum, I definitely think it’s worth looking into why the newest “stable” driver can load under 5.15.x but not 5.16.x, which is probably a great clue to the root problem.

Topic		Replies	Views
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus - HP Studio G5 Linux	39	10781	March 18, 2025
Keep getting "GPU has fallen off the bus" with 3090 cards on Gigabyte MZ32-AR1 Rev 3.0 motherboard Linux gaming	13	175	June 9, 2025
GPU has fallen off the bus issues on daily basis (RTX 4090) Linux pcie , cuda , ubuntu , rtx	9	1627	April 26, 2025
Nvidia driver Xid 79 GPU crash while idling if ASPM L0s is enabled in UEFI BIOS (GPU has fallen off the bus) Linux linux , linux-driver	5	381	April 29, 2025
GPU has fallen off the bus... Requires your serious attention Mellanox OFED kernel , linux-driver	5	1555	April 16, 2025
570 Random Freeze: GPU has fallen off the bus Linux	8	830	May 15, 2025
XID 79, GPU has fallen off the bus - Happens on idle, only Linux	16	484	March 30, 2025
GTX 1070 "GPU has fallen off the bus" running 3D games in Arch Linux Linux	15	7923	March 19, 2020
Driver does not wake GPU properly after suspend (Ubuntu 18.10 with branch 390, 410 and 415) Linux	18	17299	February 22, 2021
NVIDIA 515 - RTX 3060 - GPU has fallen off the bus Linux hw , nvbugs , kb	21	4789	March 15, 2025

"GPU has fallen off the bus" while idle, only occurs when all displays powered off

Related topics