Kernel: pcieport 0000:00:01.0: AER: device recovery failed, with RTX 4090 only

Since I received two RTW 4090, both make my system completely freezing, after 1 and 3 minutes after each boot, no matter what I’m doing (idle on desktop or trying a game).

Little summary:

Here is my everyday setup without any freeze:

System:    Kernel: 5.15.0-52-generic x86_64 bits: 64 compiler: N/A Desktop: Cinnamon 5.2.7 
           wm: muffin dm: LightDM Distro: Linux Mint 20.3 Una base: Ubuntu 20.04 focal 
Machine:   Type: Desktop System: ASUS product: N/A v: N/A serial: <filter> 
           Mobo: ASUSTeK model: ROG MAXIMUS Z690 HERO v: Rev 1.xx serial: <filter> 
           UEFI: American Megatrends v: 1304 date: 03/07/2022 
Battery:   Device-1: hid-80:4a:14:6d:aa:d5-battery model: Magic Keyboard with Numeric Keypad 
           serial: N/A charge: N/A status: Discharging 
CPU:       Topology: 12-Core model: 12th Gen Intel Core i9-12900K bits: 64 type: MT MCP arch: N/A 
           L2 cache: 30.0 MiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 152985 
           Speed: 700 MHz min/max: 800/5200 MHz Core speeds (MHz): 1: 700 2: 701 3: 700 4: 700 
           5: 786 6: 800 7: 700 8: 701 9: 800 10: 800 11: 800 12: 800 13: 700 14: 800 15: 800 
           16: 800 17: 800 18: 700 19: 700 20: 700 21: 700 22: 700 23: 798 24: 800 
Graphics:  Device-1: NVIDIA vendor: ASUSTeK driver: nvidia v: 520.56.06 bus ID: 01:00.0 
           chip ID: 10de:2206 
           Display: x11 server: X.Org 1.20.13 driver: nvidia tty: N/A 
           OpenGL: renderer: NVIDIA GeForce RTX 3080/PCIe/SSE2 v: 4.6.0 NVIDIA 520.56.06 
           direct render: Yes

sudo lspci:

00:00.0 Host bridge: Intel Corporation Device 4660 (rev 02)
00:01.0 PCI bridge: Intel Corporation Device 460d (rev 02)
00:06.0 PCI bridge: Intel Corporation Device 464d (rev 02)
00:0a.0 Signal processing controller: Intel Corporation Device 467d (rev 01)
00:0e.0 RAID bus controller: Intel Corporation Volume Management Device NVMe RAID Controller
00:14.0 USB controller: Intel Corporation Device 7ae0 (rev 11)
00:14.2 RAM memory: Intel Corporation Device 7aa7 (rev 11)
00:14.3 Network controller: Intel Corporation Device 7af0 (rev 11)
00:15.0 Serial bus controller [0c80]: Intel Corporation Device 7acc (rev 11)
00:15.1 Serial bus controller [0c80]: Intel Corporation Device 7acd (rev 11)
00:15.2 Serial bus controller [0c80]: Intel Corporation Device 7ace (rev 11)
00:16.0 Communication controller: Intel Corporation Device 7ae8 (rev 11)
00:17.0 SATA controller: Intel Corporation Device 7ae2 (rev 11)
00:1b.0 PCI bridge: Intel Corporation Device 7ac0 (rev 11)
00:1c.0 PCI bridge: Intel Corporation Device 7ab8 (rev 11)
00:1c.1 PCI bridge: Intel Corporation Device 7ab9 (rev 11)
00:1c.3 PCI bridge: Intel Corporation Device 7abb (rev 11)
00:1c.4 PCI bridge: Intel Corporation Device 7abc (rev 11)
00:1d.0 PCI bridge: Intel Corporation Device 7ab0 (rev 11)
00:1f.0 ISA bridge: Intel Corporation Device 7a84 (rev 11)
00:1f.3 Audio device: Intel Corporation Device 7ad0 (rev 11)
00:1f.4 SMBus: Intel Corporation Device 7aa3 (rev 11)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device 7aa4 (rev 11)
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2206 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
05:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
06:00.0 Ethernet controller: Intel Corporation Device 15f3 (rev 03)
07:00.0 PCI bridge: Intel Corporation Device 1136 (rev 02)
08:00.0 PCI bridge: Intel Corporation Device 1136 (rev 02)
08:01.0 PCI bridge: Intel Corporation Device 1136 (rev 02)
08:02.0 PCI bridge: Intel Corporation Device 1136 (rev 02)
08:03.0 PCI bridge: Intel Corporation Device 1136 (rev 02)
09:00.0 USB controller: Intel Corporation Device 1137
3d:00.0 USB controller: Intel Corporation Device 1138

sudo lspci -tv

-[0000:00]-+-00.0  Intel Corporation Device 4660
           +-01.0-[01]--+-00.0  NVIDIA Corporation Device 2206
           |            \-00.1  NVIDIA Corporation Device 1aef
           +-06.0-[02]----00.0  Samsung Electronics Co Ltd Device a80a
           +-0a.0  Intel Corporation Device 467d
           +-0e.0  Intel Corporation Volume Management Device NVMe RAID Controller
           +-14.0  Intel Corporation Device 7ae0
           +-14.2  Intel Corporation Device 7aa7
           +-14.3  Intel Corporation Device 7af0
           +-15.0  Intel Corporation Device 7acc
           +-15.1  Intel Corporation Device 7acd
           +-15.2  Intel Corporation Device 7ace
           +-16.0  Intel Corporation Device 7ae8
           +-17.0  Intel Corporation Device 7ae2
           +-1b.0-[03]--
           +-1c.0-[04]--
           +-1c.1-[05]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA Controller
           +-1c.3-[06]----00.0  Intel Corporation Device 15f3
           +-1c.4-[07-70]----00.0-[08-70]--+-00.0-[09]----00.0  Intel Corporation Device 1137
           |                               +-01.0-[0a-3c]--
           |                               +-02.0-[3d]----00.0  Intel Corporation Device 1138
           |                               \-03.0-[3e-70]--
           +-1d.0-[71]--
           +-1f.0  Intel Corporation Device 7a84
           +-1f.3  Intel Corporation Device 7ad0
           +-1f.4  Intel Corporation Device 7aa3
           \-1f.5  Intel Corporation Device 7aa4

As soon as I try to switch my 3080 with one of the RTX 4090, freeze occurs straight away, here is the systemctl log:

Oct 18 20:26:07 linuxmint kernel: NVRM: GPU at PCI:0000:01:00: GPU-bebd8b3c-a7ea-a03e-ee01-4c6d74304ee3
Oct 18 20:26:07 linuxmint kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Oct 18 20:26:07 linuxmint kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:01.0
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0:   device [8086:460d] error status/mask=00100000/00010000
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0:    [20] UnsupReq               (First)
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0: AER:   TLP Header: 34000000 01000010 00000000 00000000
Oct 18 20:26:07 linuxmint kernel: nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
Oct 18 20:26:07 linuxmint kernel: snd_hda_intel 0000:01:00.1: AER: can't recover (no error_detected callback)
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0: AER: device recovery failed

I tried pci=check_enable_amd_mmconf and idle=nomwait as kernel parameters, as long as the others described here "GPU has fallen off the bus" while idle, only occurs when all displays powered off - #2 by BCS, and also pci=noaer in grub boot option.

the PSU is a 1000w Gigabyte UD1000GM PG5 with a PCIe Gen 5.0 (1 x 16 pines) port (and no issue for hours with same setup under Windows).

I also tried to uninstall the driver: sudo apt remove --purge nvidia-* and reinstall it, but same issue. Sometimes I have time to launch a game, and it work in ultra full speed 180hz.

I tried with a fresh linux mint 21 installation + 520 drivers and Steam, same issue: freeze
Same GPU on windows 11, no issues.

The thermal are also good: https://i.imgur.com/DA4ER2O.png

I’m beetween Kernel or driver bug, what can I try to find what’s the problem ?

I recommend posting on the forums instead of DM so you get more people seeing the issue, and so others can find your description and any eventual solutions in the future.

My system has an AMD processor and chipset and you’re on Intel, so I’m not surprised that the check_enable_amd_mmconf option that worked for me didn’t for you. I think the underlying issue is the same, though: the Nvidia driver just doesn’t try very hard to keep going after minor errors or protocol mismatches, even though the PCI subsystem classifies them as non-fatal. The PCIe specification is very carefully designed to be robust and fault-tolerant, but Nvidia’s drivers are not.

The best clue you have is the “error status/mask=00100000/00010000” in the kernel AER message. The meanings of these registers are explained on pp. 497-499 of https://www.cl.cam.ac.uk/~djm202/pdf/specifications/pcie/PCI_Express_Base_Rev_2.0_20Dec06a.pdf ; status 00100000 (bit 20 set) means the card thinks it received an unsupported request from the bus controller, and mask 00010000 (bit 16 set) means the card is set to (mostly) ignore unexpected command completion, but to fully report all other uncorrectable errors. If you’re comfortable with the setpci utility, you could try setting bit 20 in the uncorrectable error mask register (i.e. set its value to 0x00110000), and/or clearing the “unsupported request reporting bit” (bit 3, 0x8) from the device control register. That doesn’t address whatever the underlying issue is, but it might stop the Nvidia driver from just assuming the GPU has “fallen off the bus” when it sends a request the card can’t support.

If you search through the specification for the phrase “unsupported request” you will find all of the cases that trigger that error: it could be invoking a command or function that the card doesn’t support, it could be that your motherboard and card are handshaking on a version of the PCIe protocol that doesn’t support all of the features the driver wants to use (especially if this is an older motherboard), it could be that the CPU is not waiting long enough after certain commands before sending the next command, or it could be that the CPU is directing the card to read from or write to memory that the card doesn’t think it has access to. That last cause is probably most likely, as the memory region negotiation between card, chipset, firmware, and OS can be complex and error-prone. I’d comb through the output of “dmesg | grep -i -e pci -e acpi -e nvidia -e gpu” to see if there were any problems with the memory window assignments. It’s probably also worth a look through your BIOS/UEFI options to see whether you can simplify any of the memory mapping, e.g. by disabling BAR resizing. You can try booting with pci=nocrs or pci=noacpi, although that may break other device drivers.

Thank you for your answer @BCS. The post is on the forum, maybe you received a DM because I linked your post in mine.

I’ll have a look tomorrow at the setpci tool. Don’t know how it works, but I’ll have a look and try to find with the settings you suggested me. Really hope a workaround can be find, having a 4090 sleeping next to me without using it is kind of sad.

I’ll also paste the dmesg log, and as my MB is really recent (ROG MAXIMUS Z690 HERO) I’ll check in the BIOS ank kind of “memory” settings.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Sure. Here it is
nvidia-bug-report.log.gz (457.5 KB)

That looks like a lot of mainboard issues. Thunderbolt controller freaking out, ACPI errors a-go-go, resource allocation quirks. Furthermore, 20.04 based Mint isn’t really a good choice for it, needs newer kernel and firmware.
Please

Afterwards, please create a new nvidia-bug-report.log

I don’t have great news. While I was under windows 11 to download the bios, it occurs crash too, with black screen, and reboot (If you need memory dump, let me know). Anyway:

  • Bios updated 1304 > 2103
  • Kernel Liquorix installed: 5.19.0-16.2-liquorix-amd64
  • linux-firmware_20220329 updated

System crashed the same way. Same issue with the other 4090 GPU. What else can it be? Power supply of the MB or GPU? but I just bought a new 1000w and no issue with the 3080.

Log file attached
nvidia-bug-report.log.gz (407.6 KB)

I’d rather say bad quality mainboard/bios. There’s also an issue with USB. I guess the 4090 just triggers it.
Firmware is fine now.
Please check if you can disable the TB controller in bios, maybe that helps getting it stable.

Which memory type are you using, DDR4 or 5? The board only reports “OUT OF SPEC” for it at 4000MT/s.

Also, did you always use the same slot?

All USB disconnected, and TB deactivated in the BIOS. I’m using Corsair DOMINATOR PLATINUM RGB DDR5 32 GB RAM.

For the slot I also wanted to try. the second one wasn’t accessible due to the size of the 4090, but because I had another 4090 watercooled, I could try.

Same freezes in all cases. What really surprised me, it’s the fact that everything works great with the 3080. Just swaping the cards makes all my system crash. I also check if all was properly powered, but it seems there is no issue with that (GPU-Z tells me everything is fine)

Don’t really know what I can try next, as the issue also occurs under windows 11 up to date.

Since it also occurs with Windows, I guess you should contact Asus support about compatibility of their board with the 4090.

@BCS With pci=noacpi, I can’t boot. With pci=nocrs gives me same error as before :

Oct 20 20:34:07 linuxmint kernel: NVRM: GPU at PCI:0000:01:00: GPU-bebd8b3c-a7ea-a03e-ee01-4c6d74304ee3
Oct 20 20:34:07 linuxmint kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Oct 20 20:34:07 linuxmint kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Oct 20 20:34:07 linuxmint kernel: pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:01.0
Oct 20 20:34:07 linuxmint kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Oct 20 20:34:07 linuxmint kernel: pcieport 0000:00:01.0:   device [8086:460d] error status/mask=00100000/00010000
Oct 20 20:34:07 linuxmint kernel: pcieport 0000:00:01.0:    [20] UnsupReq               (First)
Oct 20 20:34:07 linuxmint kernel: pcieport 0000:00:01.0: AER:   TLP Header: 34000000 01000010 00000000 00000000
Oct 20 20:34:07 linuxmint kernel: nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
Oct 20 20:34:07 linuxmint kernel: snd_hda_intel 0000:01:00.1: AER: can't recover (no error_detected callback)
Oct 20 20:34:07 linuxmint kernel: pcieport 0000:00:01.0: AER: device recovery failed

About the dmesg log:

[   18.312381] pcieport 0000:00:1c.4: AER: Uncorrected (Non-Fatal) error received: 0000:00:1c.4
[   18.312389] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   18.312390] pcieport 0000:00:1c.4:   device [8086:7abc] error status/mask=00100000/00004000
[   18.312391] pcieport 0000:00:1c.4:    [20] UnsupReq               (First)
[   18.312391] pcieport 0000:00:1c.4: AER:   TLP Header: 34000000 07000052 00000000 00000000
[   18.312406] pcieport 0000:00:1c.4: AER: device recovery failed
[   18.312501] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:00:1c.4
[   18.312506] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
[   18.312506] pcieport 0000:00:1c.4:   device [8086:7abc] error status/mask=00008000/00002000
[   18.312507] pcieport 0000:00:1c.4:    [15] HeaderOF              
[   18.312511] pcieport 0000:00:1c.4: AER: Uncorrected (Non-Fatal) error received: 0000:00:1c.4
[   18.312520] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   18.312520] pcieport 0000:00:1c.4:   device [8086:7abc] error status/mask=00100000/00004000
[   18.312521] pcieport 0000:00:1c.4:    [20] UnsupReq               (First)
[   18.312522] pcieport 0000:00:1c.4: AER:   TLP Header: 34000000 07000052 00000000 00000000
[   18.312536] pcieport 0000:00:1c.4: AER: device recovery failed
[   18.312635] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:00:1c.4
[   18.312640] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
[   18.312641] pcieport 0000:00:1c.4:   device [8086:7abc] error status/mask=00008000/00002000
[   18.312641] pcieport 0000:00:1c.4:    [15] HeaderOF              
[   18.312646] pcieport 0000:00:1c.4: AER: Uncorrected (Non-Fatal) error received: 0000:00:1c.4
[   18.312654] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   18.312655] pcieport 0000:00:1c.4:   device [8086:7abc] error status/mask=00100000/00004000
[   18.312655] pcieport 0000:00:1c.4:    [20] UnsupReq               (First)
[   18.312656] pcieport 0000:00:1c.4: AER:   TLP Header: 34000000 07000052 00000000 00000000
[   18.312673] pcieport 0000:00:1c.4: AER: device recovery failed
[   18.312761] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:00:1c.4
[   18.312766] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
[   18.312766] pcieport 0000:00:1c.4:   device [8086:7abc] error status/mask=00008000/00002000
[   18.312767] pcieport 0000:00:1c.4:    [15] HeaderOF              
[   18.312772] pcieport 0000:00:1c.4: AER: Uncorrected (Non-Fatal) error received: 0000:00:1c.4
[   18.312780] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   18.312781] pcieport 0000:00:1c.4:   device [8086:7abc] error status/mask=00100000/00004000
[   18.312781] pcieport 0000:00:1c.4:    [20] UnsupReq               (First)
[   18.312782] pcieport 0000:00:1c.4: AER:   TLP Header: 34000000 07000052 00000000 00000000
[   18.312797] pcieport 0000:00:1c.4: AER: device recovery failed
[   18.312876] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:00:1c.4
[   18.312881] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
[   18.312881] pcieport 0000:00:1c.4:   device [8086:7abc] error status/mask=00008000/00002000
[   18.312882] pcieport 0000:00:1c.4:    [15] HeaderOF              
[   18.312886] pcieport 0000:00:1c.4: AER: Uncorrected (Non-Fatal) error received: 0000:00:1c.4
[   18.312894] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   18.312895] pcieport 0000:00:1c.4:   device [8086:7abc] error status/mask=00100000/00004000
[   18.312896] pcieport 0000:00:1c.4:    [20] UnsupReq               (First)
[   18.312896] pcieport 0000:00:1c.4: AER:   TLP Header: 34000000 07000052 00000000 00000000
[   18.312911] pcieport 0000:00:1c.4: AER: device recovery failed