Random Xid 79 Crashing

Every day at random intervals, my 5070 Ti will drop the video signal and blast its fans. The only relevant log message I can find is “Xid 79: The GPU has fallen off the bus”. It does not depend on load or temperature.

I do not think it is a hardware issue since they are all brand new components in a custom PC, and I have reseated the GPU a couple times.

II have been trying to figure this out for weeks, but I am not familiar with really low-level debugging, so can someone help me determine if this is a hardware problem, or if it is a driver bug, kernel misconfiguration, or something else. If it is a hardware problem, that will be really bad for me, so I want to know for sure.

nvidia-bug-report.log.gz (439.7 KB)

I have attached the nvidia-bug-report dump, but here is my system at a glance:

  • Gentoo Linux 6.12.58 running Wayland with nvidia-open 580.95.05 drivers

  • RTX 5070 Ti (ZOTAC Solid SFF OC)

  • Corsair 750W Platinum PSU

  • ASRock B650I Lightning WIFI Motherboard

  • AMD Ryzen 7 7800X3D CPU

  • Also I have already tried disabling pcie_aspm in my kernel cmdline. I can’t disable nvidia.NVreg_EnableGpuFirmware since the 50** requries the open drivers, which require GSP

From the bug report, pcie_aspm=off is in your kernel cmdline but the lspci data shows ASPM L1 is still enabled on both sides of the link:

  • Root port (0000:00:01.1): LnkCtl: ASPM L1 Enabled

  • GPU endpoint (0000:01:00.0): LnkCtl: ASPM L1 Enabled

The kernel parameter did not take effect on this platform. On AMD B650 boards the firmware often controls ASPM independently and kernel flags can’t override it.

The crash happened at 02:09 overnight while the system was idle. That lines up with ASPM L1 — the GPU goes into a low power state when nothing is using it, and if it fails to wake back up, the driver loses contact and logs Xid 79.

Disable ASPM in BIOS directly. On B650 boards it’s usually under AMD PBS or AMD Overclocking → NBIO → PCIe. Then verify with:

sudo lspci -vvv -s 00:01.1 | grep -i "LnkCtl"
sudo lspci -vvv -s 01:00.0 | grep -i "LnkCtl"

Both should show ASPM Disabled.

To verify PCIe link health independently:

git clone https://github.com/parallelArchitect/gpu-pcie-path-validator.git


If the link tests clean with ASPM actually disabled and the crashes stop, that confirms the root cause.

I can confirm that this is the issue, as I have had “Prefer Maximum Performance” set in my nvidia-settings, which has drastically reduced crashes to maybe once every other week.

However, I can’t seem to find the option to disable ASPM in my BIOS. The closest I could find was AMD CBS > PROM21 Chipset Common Options > PCI Express Power Management, which I disabled, but

sudo lspci -vvv -s 00:01.1 | grep -i "LnkCtl"
sudo lspci -vvv -s 01:00.0 | grep -i "LnkCtl"

still show

LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+

So if anyone happens to know the correct BIOS setting for my motherboard, please let me know.

After moving my computer around some, this issue returned even with “Prefer Maximum Performance” enabled, so I have concluded that this is in fact a hardware issue. I RMA’d my GPU, but the replacement is doing the exact same thing, so I suppose this is no longer an NVIDIA issue.