Nvidia 410.72 random segfaults

Recently on newer Nvidia drivers, I’ve been seeing segfaults/system hangs and PCIe error spam in dmseg. There doesn’t seem to be any real trigger for them - they just kind of happen. At first I thought it was a kernel bug introduced in 4.18 but not with 4.19 i’m not so sure…

Attached is dmseg log.

dmesg.txt (95.9 KB)

It’s truncated, please attach as file. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

Didn’t even notice.

Added.

The pci bridge 03.1 which the nvidia is connected to is reporting errors, doesn’t look good.
Apart from that, the 410 driver doesn’t seem fit for the 4.19 kernel, yet:
https://devtalk.nvidia.com/default/topic/1043938/linux/nvidia-geforce-4xx-series-drivers-segfault-with-kernel-4-19-x/

Addendum: the aer errors show up between the pci root and the bridge 00.0->03.1 which means on the mainboard.

It’s the version Arch is shipping. This happened with older drivers and 4.18 kernel as well.

Checked if it works in another slot?

Addendum: the aer errors show up between the pci root and the bridge 00.0->03.1 which means on the mainboard.

What do you mean by this? So it’s not a GPU error then?

Checked if it works in another slot?

I don’t know how to check since it’s seemingly random. I’d be limited to 8x speed too.

No, the gpu is not involved, just affected. This looks like errors on the promontory bridge. While this is a standard flaw on the TR X399 chipset, I’ve never seen this on a Ryzen X370 chipset, looks like a HW failure of the mainboard to me. Maybe try the X399 workaround, if possible, in bios, set the promontory bridge to gen2 and add the kernel parameter pci=noaer
Might also be a system memory failure, don’t use memtest, just remove all memory modules but one, check, then swap.

You know, now that you mention it… when memory clock speed was set too high/wasn’t stable it was always the GPU that started going wonky in Windows. I didn’t even think about it until now because Windows worked fine. I’ve also been experiencing model flickering in Metro Redux in Linux which may be related. I’m using 32GB 2800Mhz(Corsair Vengence 3200Mhz).

Well, guess I’ll be decreasing my memory speed to see if it helps any…

I’ve decreased the memory clock speeds and it’s still segfaulting. Is there anyway I can get in contact with Nvidia besides the forums?

linux-bugs[at]nvidia.com