PCIe errors after enabling ASPM power saving (GPU has fallen off the bus)

michael_nvidia · June 21, 2023, 7:21pm

Hey!

I recently noticed that the ASUS PRIME mainboards ship with PCI ASPM (Active State Power Management) disabled by default.

I turned on PCI ASPM in my BIOS settings (setting value: L0sL1) and indeed my PC uses less power.

However, sometimes the picture freezes and I have to reboot my machine. I get the following error messages in syslog:

Jun 14 20:49:15 midna kernel: pcieport 0000:00:01.0: AER: Multiple Corrected error received: 0000:00:01.0
Jun 14 20:49:15 midna kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jun 14 20:49:15 midna kernel: pcieport 0000:00:01.0:   device [8086:460d] error status/mask=00001000/00002000
Jun 14 20:49:15 midna kernel: pcieport 0000:00:01.0:    [12] Timeout               
Jun 14 20:49:15 midna kernel: pcieport 0000:00:01.0: AER:   Error of this Agent is reported first
Jun 14 20:49:15 midna kernel: nvidia 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jun 14 20:49:15 midna kernel: nvidia 0000:01:00.0:   device [10de:2486] error status/mask=00001000/0000a000
Jun 14 20:49:15 midna kernel: nvidia 0000:01:00.0:    [12] Timeout               
Jun 14 20:49:15 midna kernel: snd_hda_intel 0000:01:00.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jun 14 20:49:15 midna kernel: snd_hda_intel 0000:01:00.1:   device [10de:228b] error status/mask=00001000/0000a000
Jun 14 20:49:15 midna kernel: snd_hda_intel 0000:01:00.1:    [12] Timeout               
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:01.0
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0:   device [8086:460d] error status/mask=00100000/00010000
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0:    [20] UnsupReq               (First)
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0: AER:   TLP Header: 34000000 01000010 00000000 00000000
Jun 14 20:52:08 midna kernel: nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
Jun 14 20:52:08 midna kernel: snd_hda_intel 0000:01:00.1: AER: can't recover (no error_detected callback)
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0: AER: device recovery failed
Jun 14 20:52:23 midna kernel: NVRM: GPU at PCI:0000:01:00: GPU-13311476-4aa3-3cdf-28f1-5ffe801de085
Jun 14 20:52:23 midna kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jun 14 20:52:23 midna kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

I searched the forums for the error message, but all I could find were reports where the power supply or cabling turned out to be the issue.

I have two different machines that use the same mainboard model (PRIME Z690-A) but are otherwise different: different PSU, different CPU, different GPU (“Gigabyte GeForce RTX 3060 Ti Vision OC” in one, “MSI GeForce RTX 3060 Ti GAMING X TRIO” in the other).

I observe the issue in both machines once I turn on PCI ASPM.

Could you take a look and see if there are any known issues with power saving with nVidia cards on Linux? Any ideas what I could try?

Thanks

michael_nvidia · June 22, 2023, 5:31pm

In the meantime, I tried booting with pci=nomsi pci=noaer, but these kernel parameters did not help. I’m still getting an error:

Jun 22 18:03:39 midna kernel: NVRM: GPU at PCI:0000:01:00: GPU-13311476-4aa3-3cdf-28f1-5ffe801de085
Jun 22 18:03:39 midna kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jun 22 18:03:39 midna kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Jun 22 18:03:39 midna kernel: NVRM: A GPU crash dump has been created. If possible, please run
                              NVRM: nvidia-bug-report.sh as root to collect this data before
                              NVRM: the NVIDIA kernel module is unloaded.
Jun 22 18:04:20 midna kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
Jun 22 18:04:20 midna kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:4:0:0x0000000f

Edit: I should add, I’m on Linux 6.2.10, using nVidia 530.41.03

mikko.rantalainen · November 23, 2024, 12:24pm

I think I’ve hit the same issue with MSI motherboard:

My current understanding is that at least BIOS controlled PCIe ASPM is broken with all Nvidia cards. I haven’t yet tested if ASPM in OS controlled state would work.

Topic		Replies	Views
Nvidia driver Xid 79 GPU crash while idling if ASPM L0s is enabled in UEFI BIOS (GPU has fallen off the bus) Linux linux , linux-driver	4	156	December 22, 2024
PCIE Bus Error with two NVIDIA cards on Linux Linux	3	2960	October 14, 2021
PCI-E Bus Errors with ConnectX-3 and Asus X-99E WS Ethernet Adapter Cards	5	1091	August 19, 2022
GPU has fallen off the bus Linux	7	9668	September 12, 2023
Disable ASPM via kernel command line Jetson Nano pcie	22	9931	May 26, 2022
GT 650M in Acer Aspire v3 with 325.15/Kernel 3.10.5 not working Linux	10	4214	October 22, 2013
GPU has fallen off the bus issues on daily basis (RTX 4090) Linux pcie , cuda , ubuntu , rtx	8	730	December 12, 2024
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus - HP Studio G5 Linux	38	10438	April 9, 2020
Xid 79, GPU has fallen off the bus. CUDA Programming and Performance	15	25820	August 13, 2023
NVIDIA 515 - RTX 3060 - GPU has fallen off the bus Linux hw , nvbugs , kb	20	4446	March 1, 2024

PCIe errors after enabling ASPM power saving (GPU has fallen off the bus)

Related topics