PCIe errors after enabling ASPM power saving (GPU has fallen off the bus)

Hey!

I recently noticed that the ASUS PRIME mainboards ship with PCI ASPM (Active State Power Management) disabled by default.

I turned on PCI ASPM in my BIOS settings (setting value: L0sL1) and indeed my PC uses less power.

However, sometimes the picture freezes and I have to reboot my machine. I get the following error messages in syslog:

Jun 14 20:49:15 midna kernel: pcieport 0000:00:01.0: AER: Multiple Corrected error received: 0000:00:01.0
Jun 14 20:49:15 midna kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jun 14 20:49:15 midna kernel: pcieport 0000:00:01.0:   device [8086:460d] error status/mask=00001000/00002000
Jun 14 20:49:15 midna kernel: pcieport 0000:00:01.0:    [12] Timeout               
Jun 14 20:49:15 midna kernel: pcieport 0000:00:01.0: AER:   Error of this Agent is reported first
Jun 14 20:49:15 midna kernel: nvidia 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jun 14 20:49:15 midna kernel: nvidia 0000:01:00.0:   device [10de:2486] error status/mask=00001000/0000a000
Jun 14 20:49:15 midna kernel: nvidia 0000:01:00.0:    [12] Timeout               
Jun 14 20:49:15 midna kernel: snd_hda_intel 0000:01:00.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jun 14 20:49:15 midna kernel: snd_hda_intel 0000:01:00.1:   device [10de:228b] error status/mask=00001000/0000a000
Jun 14 20:49:15 midna kernel: snd_hda_intel 0000:01:00.1:    [12] Timeout               
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:01.0
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0:   device [8086:460d] error status/mask=00100000/00010000
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0:    [20] UnsupReq               (First)
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0: AER:   TLP Header: 34000000 01000010 00000000 00000000
Jun 14 20:52:08 midna kernel: nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
Jun 14 20:52:08 midna kernel: snd_hda_intel 0000:01:00.1: AER: can't recover (no error_detected callback)
Jun 14 20:52:08 midna kernel: pcieport 0000:00:01.0: AER: device recovery failed
Jun 14 20:52:23 midna kernel: NVRM: GPU at PCI:0000:01:00: GPU-13311476-4aa3-3cdf-28f1-5ffe801de085
Jun 14 20:52:23 midna kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jun 14 20:52:23 midna kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

I searched the forums for the error message, but all I could find were reports where the power supply or cabling turned out to be the issue.

I have two different machines that use the same mainboard model (PRIME Z690-A) but are otherwise different: different PSU, different CPU, different GPU (“Gigabyte GeForce RTX 3060 Ti Vision OC” in one, “MSI GeForce RTX 3060 Ti GAMING X TRIO” in the other).

I observe the issue in both machines once I turn on PCI ASPM.

Could you take a look and see if there are any known issues with power saving with nVidia cards on Linux? Any ideas what I could try?

Thanks

In the meantime, I tried booting with pci=nomsi pci=noaer, but these kernel parameters did not help. I’m still getting an error:

Jun 22 18:03:39 midna kernel: NVRM: GPU at PCI:0000:01:00: GPU-13311476-4aa3-3cdf-28f1-5ffe801de085
Jun 22 18:03:39 midna kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jun 22 18:03:39 midna kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Jun 22 18:03:39 midna kernel: NVRM: A GPU crash dump has been created. If possible, please run
                              NVRM: nvidia-bug-report.sh as root to collect this data before
                              NVRM: the NVIDIA kernel module is unloaded.
Jun 22 18:04:20 midna kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
Jun 22 18:04:20 midna kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:4:0:0x0000000f

Edit: I should add, I’m on Linux 6.2.10, using nVidia 530.41.03

I think I’ve hit the same issue with MSI motherboard:

My current understanding is that at least BIOS controlled PCIe ASPM is broken with all Nvidia cards. I haven’t yet tested if ASPM in OS controlled state would work.