Nvidia driver Xid 79 GPU crash while idling if ASPM L0s is enabled in UEFI BIOS (GPU has fallen off the bus)

mikko.rantalainen · November 23, 2024, 11:53am

I have i5-13600K and RTX 3060 12 GB running with MSI PRO Z790-P WIFI DDR4 motherboard on Ubuntu 22.04 LTS with linux-lowlatency-hwe-22.04 kernel package. It appears that power saving features cause Xid 79 error in Nvidia kernel driver which causes GPU crash with GPU has fallen off the bus message.

I’m using the nvidia-driver-550 package which is supposed to match the latest official recommended Nvidia driver version 550:

$ apt policy nvidia-driver-550
nvidia-driver-550:
  Installed: 550.127.05-0ubuntu0~gpu22.04.1
  Candidate: 550.127.05-0ubuntu0~gpu22.04.1
  Version table:
 *** 550.127.05-0ubuntu0~gpu22.04.1 500
        500 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy/main amd64 Packages
        100 /var/lib/dpkg/status

The crash appears to be somewhat random so it’s probably some kind of race condition which obviously makes it hard to diagnose accurately. It seems to trigger most easily while rendering some kind of video in Google Chrome (e.g. YouTube or video conferencing) in otherwise idle system.

I’m using lowlatency Linux kernel (PREEMPT_DYNAMIC) which may trigger race conditions more easily than the generic kernel but correctly implemented kernel drivers will not race in any situation.

Here’s steps to reproduce it using the MSI UEFI BIOS:

Settings – Advanced – PCIe Sub-system Settings – Native ASPM: Disable
Settings – Advanced – PCIe Sub-system Settings – PCIe ASPM Settings – PEG 0 ASPM: L0s
Settings – Advanced – PCIe Sub-system Settings – PCIe ASPM Settings – PEG 1 ASPM: L0s

This causes following kernel error about once every 5 hours and the GPU is effectively disconnected from the system and display keeps showing the last frame before the crash. I don’t know any other way to recover the display but restart the whole system. I can blindly execute Alt+SysRq+R to take keyboard to normal mode, Ctrl+Alt+F1 to first to first virtual terminal and Ctrl+Alt+Delete to trigger normal reboot.

The kernel error looks like this:

NVRM: GPU at PCI:0000:01:00: GPU-$UUID
NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

It appears that similar issue has been reported on ASUS board last year but the latest Nvidia driver is still buggy:

If Nvidia cannot fix this issue in the driver, at very minimum, the error message for Xid 79 should say “Make sure the system doesn’t have UEFI BIOS controlled ASPM power saving enabled. The only supported configuration is to have OS controlled ASPM enabled or all ASPM power saving features disabled.” or something that matches the actually supported settings. In case of MSI BIOS, the “Native ASPM: Disable” seems to mean “BIOS controlled powersaving” and “Native ASPM: Enable” means OS controlled power saving mode. MSI has documented this part poorly so it’s hard to know what the BIOS settings exactly try to convey.

mikko.rantalainen · November 23, 2024, 12:05pm

Here’s a bug report from nvidia-bug-report.sh after the GPU has already crashed.
nvidia-bug-report.log.gz (809.3 KB)

Looking through it, I cannot see anything else worth inspecting more but the actual crash message (repeated below) and potentially the /usr/bin/nvidia-debugdump -D which I cannot read.

I created the bug report by having following oneliner already running in root owned terminal while I was waiting the system to crash again. The crash happened while by screensaver (blank screen) was active so the system was totally idle.

while true; do (NAME=$(date +%Y%m%dT%H%M%S); mkdir $NAME; cd $NAME; nvidia-bug-report.sh;); sleep 2m; done

The log appears to have numerous warnings in kernel build triggered by -Wmissing-prototypes. Here’s one example:

warning: no previous prototype for ‘create_static_vidmem_mapping’ [-Wmissing-prototypes]
 2313 | NV_STATUS create_static_vidmem_mapping(uvm_gpu_t *gpu)

I didn’t bother to figure out if the correct fix for the driver source code would be to add static keyword for these functions or to introduce the prototype in header file. None of these appear to be important for the crash, though. Either way, the driver source should be fixed by Nvidia to avoid these warnings.

The crash happened at this timestamp:

Nov 23 01:55:47 desktop kernel: NVRM: GPU at PCI:0000:01:00: GPU-2afd3ceb-4743-4ab6-3a38-5e7657c828a7
Nov 23 01:55:47 desktop kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Nov 23 01:55:47 desktop kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Nov 23 01:55:47 desktop kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                NVRM: nvidia-bug-report.sh as root to collect this data before
                                NVRM: the NVIDIA kernel module is unloaded.

(Yes, my system hostname is creatively named desktop.) I cannot see any other log messages near the crash that would suggest that anything else but the Nvidia driver had any problems with the PCIe bus.

Update: here’s another bug report data from earlier moment of time when the system was still running correctly the same night. Perhaps this helps if you have ability to compare the data in nvidia-debugdump -D binary part:
nvidia-bug-report.log.gz (936.1 KB)

mikko.rantalainen · November 23, 2024, 12:07pm

I’m currently running with following MSI BIOS settings and the GPU hasn’t (yet?) crashed again:

Settings – Advanced – PCIe Sub-system Settings – Native ASPM: Enable
Settings – Advanced – PCIe Sub-system Settings – PCIe ASPM Settings – PEG 0 ASPM: Disabled
Settings – Advanced – PCIe Sub-system Settings – PCIe ASPM Settings – PEG 1 ASPM: Disabled

mikko.rantalainen · November 23, 2024, 6:10pm

Comparing those bug reports, the most interesting part for me is the PCIe state info where the GPU ends up with null address in Capabilities: MSI: Enable field (left side is state before the crash and right side is the state after the crash):

mikko.rantalainen · December 22, 2024, 12:14pm

After running my system for extended time with L0 and L0s power saving modes disabled for my Nvidia GPU it has been perfectly stable. So I would say that ASPM L1 mode is perfectly okay but L0 and L0s power saving modes do not work correctly.

masteraster · April 29, 2025, 7:49am

For future readers, I will reference another post revealing similar root cause on Intel NUC9.
GPU has fallen off the bus… Requires your serious attention

Topic		Replies	Views
XID 79, GPU has fallen off the bus - Happens on idle, only Linux	16	488	March 30, 2025
"GPU has fallen off the bus" while idle, only occurs when all displays powered off Linux	15	7987	March 15, 2025
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus - HP Studio G5 Linux	39	10787	March 18, 2025
GPU has fallen off the bus Linux	7	9851	September 12, 2023
GPU has fallen off the bus... Requires your serious attention Mellanox OFED kernel , linux-driver	5	1557	April 16, 2025
NVIDIA 515 - RTX 3060 - GPU has fallen off the bus Linux hw , nvbugs , kb	21	4790	March 15, 2025
Xid 79 Error: RTX 4090 GPU Falls Off Bus with NVIDIA Driver 535.161.07 on Ubuntu 22.04 LTS Server Linux	1	689	April 9, 2024
Xid 79, GPU has fallen off the bus. CUDA Programming and Performance	15	26208	August 13, 2023
Keep getting "GPU has fallen off the bus" with 3090 cards on Gigabyte MZ32-AR1 Rev 3.0 motherboard Linux gaming	18	198	June 10, 2025
"Xid:79, GPU has fallen off the bus" training a deep learning model on Nvidia 3090 Linux nvbugs	0	667	September 21, 2023

Nvidia driver Xid 79 GPU crash while idling if ASPM L0s is enabled in UEFI BIOS (GPU has fallen off the bus)

Related topics