Nvidia driver Xid 79 GPU crash while idling if ASPM L0s is enabled in UEFI BIOS (GPU has fallen off the bus)

I have i5-13600K and RTX 3060 12 GB running with MSI PRO Z790-P WIFI DDR4 motherboard on Ubuntu 22.04 LTS with linux-lowlatency-hwe-22.04 kernel package. It appears that power saving features cause Xid 79 error in Nvidia kernel driver which causes GPU crash with GPU has fallen off the bus message.

I’m using the nvidia-driver-550 package which is supposed to match the latest official recommended Nvidia driver version 550:

$ apt policy nvidia-driver-550
nvidia-driver-550:
  Installed: 550.127.05-0ubuntu0~gpu22.04.1
  Candidate: 550.127.05-0ubuntu0~gpu22.04.1
  Version table:
 *** 550.127.05-0ubuntu0~gpu22.04.1 500
        500 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy/main amd64 Packages
        100 /var/lib/dpkg/status

The crash appears to be somewhat random so it’s probably some kind of race condition which obviously makes it hard to diagnose accurately. It seems to trigger most easily while rendering some kind of video in Google Chrome (e.g. YouTube or video conferencing) in otherwise idle system.

I’m using lowlatency Linux kernel (PREEMPT_DYNAMIC) which may trigger race conditions more easily than the generic kernel but correctly implemented kernel drivers will not race in any situation.

Here’s steps to reproduce it using the MSI UEFI BIOS:

  • Settings – Advanced – PCIe Sub-system Settings – Native ASPM: Disable
  • Settings – Advanced – PCIe Sub-system Settings – PCIe ASPM Settings – PEG 0 ASPM: L0s
  • Settings – Advanced – PCIe Sub-system Settings – PCIe ASPM Settings – PEG 1 ASPM: L0s

This causes following kernel error about once every 5 hours and the GPU is effectively disconnected from the system and display keeps showing the last frame before the crash. I don’t know any other way to recover the display but restart the whole system. I can blindly execute Alt+SysRq+R to take keyboard to normal mode, Ctrl+Alt+F1 to first to first virtual terminal and Ctrl+Alt+Delete to trigger normal reboot.

The kernel error looks like this:

NVRM: GPU at PCI:0000:01:00: GPU-$UUID
NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

It appears that similar issue has been reported on ASUS board last year but the latest Nvidia driver is still buggy:

If Nvidia cannot fix this issue in the driver, at very minimum, the error message for Xid 79 should say “Make sure the system doesn’t have UEFI BIOS controlled ASPM power saving enabled. The only supported configuration is to have OS controlled ASPM enabled or all ASPM power saving features disabled.” or something that matches the actually supported settings. In case of MSI BIOS, the “Native ASPM: Disable” seems to mean “BIOS controlled powersaving” and “Native ASPM: Enable” means OS controlled power saving mode. MSI has documented this part poorly so it’s hard to know what the BIOS settings exactly try to convey.

Here’s a bug report from nvidia-bug-report.sh after the GPU has already crashed.
nvidia-bug-report.log.gz (809.3 KB)

Looking through it, I cannot see anything else worth inspecting more but the actual crash message (repeated below) and potentially the /usr/bin/nvidia-debugdump -D which I cannot read.

I created the bug report by having following oneliner already running in root owned terminal while I was waiting the system to crash again. The crash happened while by screensaver (blank screen) was active so the system was totally idle.

while true; do (NAME=$(date +%Y%m%dT%H%M%S); mkdir $NAME; cd $NAME; nvidia-bug-report.sh;); sleep 2m; done

The log appears to have numerous warnings in kernel build triggered by -Wmissing-prototypes. Here’s one example:

warning: no previous prototype for ‘create_static_vidmem_mapping’ [-Wmissing-prototypes]
 2313 | NV_STATUS create_static_vidmem_mapping(uvm_gpu_t *gpu)

I didn’t bother to figure out if the correct fix for the driver source code would be to add static keyword for these functions or to introduce the prototype in header file. None of these appear to be important for the crash, though. Either way, the driver source should be fixed by Nvidia to avoid these warnings.

The crash happened at this timestamp:

Nov 23 01:55:47 desktop kernel: NVRM: GPU at PCI:0000:01:00: GPU-2afd3ceb-4743-4ab6-3a38-5e7657c828a7
Nov 23 01:55:47 desktop kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Nov 23 01:55:47 desktop kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Nov 23 01:55:47 desktop kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                NVRM: nvidia-bug-report.sh as root to collect this data before
                                NVRM: the NVIDIA kernel module is unloaded.

(Yes, my system hostname is creatively named desktop.) I cannot see any other log messages near the crash that would suggest that anything else but the Nvidia driver had any problems with the PCIe bus.

Update: here’s another bug report data from earlier moment of time when the system was still running correctly the same night. Perhaps this helps if you have ability to compare the data in nvidia-debugdump -D binary part:
nvidia-bug-report.log.gz (936.1 KB)

I’m currently running with following MSI BIOS settings and the GPU hasn’t (yet?) crashed again:

  • Settings – Advanced – PCIe Sub-system Settings – Native ASPM: Enable
  • Settings – Advanced – PCIe Sub-system Settings – PCIe ASPM Settings – PEG 0 ASPM: Disabled
  • Settings – Advanced – PCIe Sub-system Settings – PCIe ASPM Settings – PEG 1 ASPM: Disabled

Comparing those bug reports, the most interesting part for me is the PCIe state info where the GPU ends up with null address in Capabilities: MSI: Enable field (left side is state before the crash and right side is the state after the crash):

After running my system for extended time with L0 and L0s power saving modes disabled for my Nvidia GPU it has been perfectly stable. So I would say that ASPM L1 mode is perfectly okay but L0 and L0s power saving modes do not work correctly.