I have i5-13600K and RTX 3060 12 GB running with MSI PRO Z790-P WIFI DDR4 motherboard on Ubuntu 22.04 LTS with linux-lowlatency-hwe-22.04
kernel package. It appears that power saving features cause Xid 79
error in Nvidia kernel driver which causes GPU crash with GPU has fallen off the bus
message.
I’m using the nvidia-driver-550
package which is supposed to match the latest official recommended Nvidia driver version 550:
$ apt policy nvidia-driver-550
nvidia-driver-550:
Installed: 550.127.05-0ubuntu0~gpu22.04.1
Candidate: 550.127.05-0ubuntu0~gpu22.04.1
Version table:
*** 550.127.05-0ubuntu0~gpu22.04.1 500
500 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy/main amd64 Packages
100 /var/lib/dpkg/status
The crash appears to be somewhat random so it’s probably some kind of race condition which obviously makes it hard to diagnose accurately. It seems to trigger most easily while rendering some kind of video in Google Chrome (e.g. YouTube or video conferencing) in otherwise idle system.
I’m using lowlatency Linux kernel (PREEMPT_DYNAMIC
) which may trigger race conditions more easily than the generic kernel but correctly implemented kernel drivers will not race in any situation.
Here’s steps to reproduce it using the MSI UEFI BIOS:
- Settings – Advanced – PCIe Sub-system Settings – Native ASPM: Disable
- Settings – Advanced – PCIe Sub-system Settings – PCIe ASPM Settings – PEG 0 ASPM: L0s
- Settings – Advanced – PCIe Sub-system Settings – PCIe ASPM Settings – PEG 1 ASPM: L0s
This causes following kernel error about once every 5 hours and the GPU is effectively disconnected from the system and display keeps showing the last frame before the crash. I don’t know any other way to recover the display but restart the whole system. I can blindly execute Alt+SysRq+R to take keyboard to normal mode, Ctrl+Alt+F1 to first to first virtual terminal and Ctrl+Alt+Delete to trigger normal reboot.
The kernel error looks like this:
NVRM: GPU at PCI:0000:01:00: GPU-$UUID
NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
It appears that similar issue has been reported on ASUS board last year but the latest Nvidia driver is still buggy:
If Nvidia cannot fix this issue in the driver, at very minimum, the error message for Xid 79 should say “Make sure the system doesn’t have UEFI BIOS controlled ASPM power saving enabled. The only supported configuration is to have OS controlled ASPM enabled or all ASPM power saving features disabled.” or something that matches the actually supported settings. In case of MSI BIOS, the “Native ASPM: Disable” seems to mean “BIOS controlled powersaving” and “Native ASPM: Enable” means OS controlled power saving mode. MSI has documented this part poorly so it’s hard to know what the BIOS settings exactly try to convey.