"GPU has fallen off the bus" while idle, only occurs when all displays powered off

Updating with what I’ve learned after a week of testing, for the benefit of anyone else who might be having issues:

  • 470.103.01 works the same as 470.94 for me under 5.15.x kernels. However, when I try to build and load the 470.103.01 driver under either 5.16.8 or 5.16.9, I immediately get the “GPU fallen off the bus” error and loss of signal to my monitors at the moment the driver loads. This is obviously under text mode (EFI framebuffer console) and happens while executing the NVIDIA-Linux-x86_64-470.103.01.run script, immediately after successfully building the modules. This behavior was 100% repeatable through multiple reboots on both kernels, and persisted across slight variations in kernel build and command-line parameters as described below. The 470.94 modules can be installed under 5.16.x on the same machines, but are still at risk of “GPU fallen off the bus” after extended idle periods.

The fact that the timing of the failure under 5.16.x/470.103.x is so consistent (after the card has been working fine up to that point), and that it doesn’t occur (immediately) under different kernel or driver combinations, has me fully convinced that this is indeed a driver issue and not just a flaky card, and likely a regression caused by some thread of development in recent drivers and/or kernels. After testing I’m back on a 5.15.18/470.94 combination with over three days of uptime with no problems, although I have not yet turned off the monitors.

Things I tried that did not help:

  • nvidia-smi -pm 1 (still got ‘Fallen off the bus’ errors afterward)
  • Kernel command-line option pcie_aspm=off (still got ‘Fallen off the bus’ errors in this configuration)
  • rcutree.rcu_idle_gp_delay=1 (still got ‘Fallen off the bus’ errors)
  • acpi_osi=! acpi_osi=‘Linux’ (hangs during boot)
  • Building the kernel with or without CONFIG_HOTPLUG_PCI, CONFIG_HOTPLUG_PCI_ACPI, and CONFIG_HOTPLUG_PCI_PCIE (experienced “GPU fallen off the bus” regardless).
  • Building with or without CONFIG_ACPI_FPDT (same).
  • Building with or without CONFIG_PARAVIRT, CONFIG_HYPERV, and CONFIG_XEN (changes some code paths inside the NVIDIA driver, but no apparent changes in behavior).
  • Building with or without CONFIG_PREEMPT (RT patches that became more mainstream in 5.15 and subsequent kernels; no reason to think this was an issue but I thought I’d rule it out.
  • Trying to build without CONFIG_X86_PAT (NVIDIA driver fails to compile against recent kernels in this mode even though the apparent point of nv_pat.c is to support this case, see separate bug report).

Things I still have not proven or disproven, but might be helping:

  • Upgrading from 470.94 to 470.103.01 on 5.15.18 kernel.
  • Swapped which monitor is primary and which is secondary in X (Display and Monitor section of KDE settings) to match which is marked as primary in the Display Configuration section of nvidia-settings.
  • Extended my “[monitors] switch off after” timer from 5 (the default) to 30 minutes in KDE settings (Power Management section; “Screen Energy Saving” is still enabled).
  • Changing the CONFIG_PCIEASPM mode selection: all of my previous failures (and under older NVidia drivers, successes) were with this set to “default”. The 5.15.18 kernel that’s running now was built with “performance” selected, which might prevent the GPU from going into a deep sleep state from which the NVidia driver can’t figure out how to communicate with it, if that’s indeed the problem. I want to leave it running a while longer to prove hardware stability before I try turning monitors off to see whether anything has actually changed. However, I was getting the immediate “fallen off the bus” errors when trying to load the 470.103.01 driver under 5.16.x kernels regardless of whether they were built with “default” or “performance” behavior.
  • Building the kernel with CONFIG_PCIE_EDR (Error Disconnect Recover)— same story as CONFIG_PCIEASPM_*; I never used this before but have a 5.15.18 kernel with it running now, although it didn’t help on 5.16.x.

Anyway, I’m pretty sure what I’m seeing is ultimately a driver regression or failure to keep up with PCIe/ASPM changes in the kernel, and I hope some of the above notes may be useful to others hitting the same problem. If anyone from the NVidia driver team actually reads this forum, I definitely think it’s worth looking into why the newest “stable” driver can load under 5.15.x but not 5.16.x, which is probably a great clue to the root problem.

2 Likes