"GPU has fallen off the bus" while idle, only occurs when all displays powered off

I’ve had this error sporadically under the 470.94 driver. In contrast to previous reports of this error happening while the GPU is heavily loaded, possibly due to power supply problems, I’ve only encountered it while the machine was idle for 6+ hours, and only while both attached monitors were manually switched off. I have not yet had this fault while actively using the system, or with the monitors in an automatic sleep mode, so I wonder whether it might be a DDC issue (e.g. driver crapping itself when the DDC is down for long periods). I just downloaded 470.103 and will see whether that changes anything, although I don’t see any related bugfixes in the release notes.

When the error occurs, the displays are rendered permanently inactive (until reboot) and the kernel message log shows the following:

[657106.751771] NVRM: GPU at PCI:0000:28:00: GPU-a60c1060-54c9-b922-7733-c717fdc14af5
[657106.751779] NVRM: Xid (PCI:0000:28:00): 79, pid=1907, GPU has fallen off the bus.
[657106.751786] NVRM: GPU 0000:28:00.0: GPU has fallen off the bus.
[657106.751802] NVRM: GPU 0000:28:00.0: GPU serial number is .
[657106.751822] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[657444.372680] GpuWatchdog[6269]: segfault at 0 ip 00007fc8ad762b5f sp 00007fc8a47bd460 error 6 in libcef.so[7fc8a9528000+6f56000]

You can see from the timestamps that I had plenty of stable uptime before the fault. I can log into the machine via SSH and manually unload the nvidia kernel modules, but when I attempt another modprobe nvidia I get a long stream of

[662558.412133] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000987d:0:0:0x0000000f

(repeated many times) and then eventually a crash in nvidia_dev_put():

[663149.792091] [drm] [nvidia-drm] [GPU ID 0x00002800] Unloading driver
[663149.793553] ------------[ cut here ]------------
[663149.793555] WARNING: CPU: 13 PID: 1227 at /tmp/selfgz7248/NVIDIA-Linux-x86_64-470.94/kernel/nvidia/nv.c:4757 nvidia_dev_put+0x92/0xa0 [nvidia]
[663149.793974] CPU: 13 PID: 1227 Comm: rmmod Tainted: P O 5.15.18 #4
[663149.793980] RIP: 0010:nvidia_dev_put+0x92/0xa0 [nvidia]

Full nvidia-bug-report.sh logs attached. You’ll note in the dmesg log that I had booted the kernel with the command-line option rcutree.rcu_idle_gp_delay=1, which was suggested in another forum as a folk remedy for these errors but obviously doesn’t help. My system hangs on boot when I add acpi_osi=! acpi_osi='Linux', another frequently suggested folk remedy.

I don’t recall every seeing these errors when I was running the 460.x series drivers, although those had other stability problems that 470.x generally seems to improve upon.
nvidia-bug-report.log.gz (1.1 MB)

Updating with what I’ve learned after a week of testing, for the benefit of anyone else who might be having issues:

  • 470.103.01 works the same as 470.94 for me under 5.15.x kernels. However, when I try to build and load the 470.103.01 driver under either 5.16.8 or 5.16.9, I immediately get the “GPU fallen off the bus” error and loss of signal to my monitors at the moment the driver loads. This is obviously under text mode (EFI framebuffer console) and happens while executing the NVIDIA-Linux-x86_64-470.103.01.run script, immediately after successfully building the modules. This behavior was 100% repeatable through multiple reboots on both kernels, and persisted across slight variations in kernel build and command-line parameters as described below. The 470.94 modules can be installed under 5.16.x on the same machines, but are still at risk of “GPU fallen off the bus” after extended idle periods.

The fact that the timing of the failure under 5.16.x/470.103.x is so consistent (after the card has been working fine up to that point), and that it doesn’t occur (immediately) under different kernel or driver combinations, has me fully convinced that this is indeed a driver issue and not just a flaky card, and likely a regression caused by some thread of development in recent drivers and/or kernels. After testing I’m back on a 5.15.18/470.94 combination with over three days of uptime with no problems, although I have not yet turned off the monitors.

Things I tried that did not help:

  • nvidia-smi -pm 1 (still got ‘Fallen off the bus’ errors afterward)
  • Kernel command-line option pcie_aspm=off (still got ‘Fallen off the bus’ errors in this configuration)
  • rcutree.rcu_idle_gp_delay=1 (still got ‘Fallen off the bus’ errors)
  • acpi_osi=! acpi_osi=‘Linux’ (hangs during boot)
  • Building the kernel with or without CONFIG_HOTPLUG_PCI, CONFIG_HOTPLUG_PCI_ACPI, and CONFIG_HOTPLUG_PCI_PCIE (experienced “GPU fallen off the bus” regardless).
  • Building with or without CONFIG_ACPI_FPDT (same).
  • Building with or without CONFIG_PARAVIRT, CONFIG_HYPERV, and CONFIG_XEN (changes some code paths inside the NVIDIA driver, but no apparent changes in behavior).
  • Building with or without CONFIG_PREEMPT (RT patches that became more mainstream in 5.15 and subsequent kernels; no reason to think this was an issue but I thought I’d rule it out.
  • Trying to build without CONFIG_X86_PAT (NVIDIA driver fails to compile against recent kernels in this mode even though the apparent point of nv_pat.c is to support this case, see separate bug report).

Things I still have not proven or disproven, but might be helping:

  • Upgrading from 470.94 to 470.103.01 on 5.15.18 kernel.
  • Swapped which monitor is primary and which is secondary in X (Display and Monitor section of KDE settings) to match which is marked as primary in the Display Configuration section of nvidia-settings.
  • Extended my “[monitors] switch off after” timer from 5 (the default) to 30 minutes in KDE settings (Power Management section; “Screen Energy Saving” is still enabled).
  • Changing the CONFIG_PCIEASPM mode selection: all of my previous failures (and under older NVidia drivers, successes) were with this set to “default”. The 5.15.18 kernel that’s running now was built with “performance” selected, which might prevent the GPU from going into a deep sleep state from which the NVidia driver can’t figure out how to communicate with it, if that’s indeed the problem. I want to leave it running a while longer to prove hardware stability before I try turning monitors off to see whether anything has actually changed. However, I was getting the immediate “fallen off the bus” errors when trying to load the 470.103.01 driver under 5.16.x kernels regardless of whether they were built with “default” or “performance” behavior.
  • Building the kernel with CONFIG_PCIE_EDR (Error Disconnect Recover)— same story as CONFIG_PCIEASPM_*; I never used this before but have a 5.15.18 kernel with it running now, although it didn’t help on 5.16.x.

Anyway, I’m pretty sure what I’m seeing is ultimately a driver regression or failure to keep up with PCIe/ASPM changes in the kernel, and I hope some of the above notes may be useful to others hitting the same problem. If anyone from the NVidia driver team actually reads this forum, I definitely think it’s worth looking into why the newest “stable” driver can load under 5.15.x but not 5.16.x, which is probably a great clue to the root problem.

1 Like

Just to close this one out: these problems have gone away after upgrading to driver 470.103.01 and adding the kernel boot options pci=check_enable_amd_mmconf and idle=nomwait. I don’t know which of those three changes made the biggest difference, but I’ve gone over a month without seeing the “GPU has fallen off the bus” errors, even with monitors switched off for more than 24 hours at a time. Hope this helps someone else!