Ubuntu upgrade from 19.04 > 19.10 & enabling Nvidia 435 floods "PCIE Bus Error: Severity=Corrected"

Upgraded my Dell XPS15 9560 to Ubuntu 19.10 and enabled driver 435. When now selecting Nvidia “performance mode” or “on-demand”, syslog gets flooded with

PCIE Bus Error: Severity=Corrected after booting into Ubuntu

I tried to get rid of with with adding the boot-parameter ‘pci=nomsi’ to grub:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=nomsi"

But that shot up my idle power from 3.8W to 8W, so double. It appears ‘tick_sched_timer’ started to work like crazy.

So…, basically, Nvidia became useless and unusable with 435 on Ubuntu latest… :(

Any ideas?

Rather try
pci=noaer

Oops… unless I am seriously mistaken, I consider this a bad suggestion…

Switch off serious error reporting because we don’t want to see the errors and not solving the hardware related IRQ issues that cause the errors the first place?

Or am I missing something?

Those are low-level bus errors, you can’t really fix them besides maybe upgrading the bios. I also wouldn’t know what those have to do with irqs. So turning off aer at least prevents log spam.

These bus errors did not appear in previous Nvidia drivers, just in the new 435 …

To make things clearer, nomsi is turning off msi and aer meaning it’s kind of a placebo, it doesn’t fix those errors, it’s just turning them off as well.

Are you sure about that? As far as I can tell, no-msi, turns off the message signalling interrupts hence the power peaking, but still allows error reporting for other PCI related issues.

Anyway, back to issue itself: good news is that it seems an Nvidia driver issue and nothing “deeper”… Bad news we cant use Nvidia anymore in Optimus laptops; will try to downgrade the driver…

http://linux-kernel.2935.n7.nabble.com/PATCH-PCI-Disable-AER-with-pci-nomsi-td495976.html

ok, so aer automatically off when msi is off…

When going back to the issue: are you saying that the 435 driver is behaving properly, the errors are nothing to care about, but we just should not log them? It’s for me counter intuitive, but I am far from a kernel programmer ))

Since it’s Severity=Corrected I’d just turn off aer. Of course it’s not like everything was fine, those errors always point towards quality problems with the mainboard or better said, add-ons like wifi-cards (m.2/mini-pcie) or sd-card readers. You didn’t post the full error message which would also show which device reports that bus error but I would bet that it’s not the nvidia gpu. Installing the nvidia driver could trigger this because it puts load on the pcie bus but it’s very unlikely the cause of it. Rather changes in the kernel.
To be ultimately sure you’ll have to revert to an earlier driver, of course.

Ah… thanks for the great qualification. You might have a point as I also upgraded the default Killer WiFi card to and Intel 9260 one recently to upgrade to Bluetooth 5… did not think of that at all…

Full error I am getting:

Nov  4 10:42:14 tom-XPS-15-9560 kernel: [   88.314649] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:04:00.0
Nov  4 10:42:14 tom-XPS-15-9560 kernel: [   88.314659] nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Nov  4 10:42:14 tom-XPS-15-9560 kernel: [   88.314666] nvme 0000:04:00.0: AER:   device [1c5c:1283] error status/mask=00000001/0000e000
Nov  4 10:42:14 tom-XPS-15-9560 kernel: [   88.314671] nvme 0000:04:00.0: AER:    [ 0] RxErr

The errors keep appearing, even if I switch to airplane mode, so the card should be off (?)

Device 1c5c:1283 is your nvme drive. Don’t know if reseating this would help.

Ah, thank you.
Your tip directed me to Google and got: https://wiki.archlinux.org/index.php/Dell_XPS_15_9560 which miraculously shows my model laptop… ;-)

There they suggest:

pci=nommconf

or

pcie_aspm=off

but the last one might up the power again

Any clue why this is related to Nvidia? Or are we just talking about “imbalanced behavior”?

nommconf also just disables aer, another placebo. Disabling aspm can indeed fix those errors but it is disabled on most notebooks anyway. Run
sudo dmesg |grep -i aspm
to see whether or not it is available.

sudo dmesg |grep -i aspm does not give me any reply…

(Just upgrade BIOS to latest version 1.16, no difference)

Maybe the log spam pushed out early boot messages. Just try using pcie_aspm=off and check if it changes anything.

  • ‘pcie_aspm=off’ works, no more errors… but after running few tests, the idle power seems +25% so not really option for me
  • For now I settle on the “pci-noaer” placebo / bandaid … :(
  • Also back to nvidia 435, issues on 430 were same

Somehow it seems that the Nvidia chip and the NVME are not all behaving nicely on the PCI bus in my XPS15?

Weird thing is I never saw this on Ubuntu 19.04 so I am still baffled. Hope others experience same issue so we might dig deeper

@generix: So far thank you for your help, much appreciated !

19.04 had kernel 5.0, 19.10 has kernel 5.3 so I’d rather suspect some power management changes inbetween which is triggering this on your nvme drive.