XID 79, GPU has fallen off the bus - Happens on idle, only

Hello,

I am running Ubuntu 22.04.05 with Nvidia Kernel Module 560.35.03
The CPU is an AMD EPYC 7702, motherboard is AsRockRack ROMED8-2T.
The graphics card has a 1500W 80+ Platinum PSU available to it. The motherboard and other system components are on a separate power supply.

I have reseated the PCIe connections. I have swapped the power cables for the GPU. I have tried this GPU in another PCIe slot that is confirmed working properly. The power supply connected to the GPU has another GPU also connected that has reported no errors.

Swapping this GPU into another PCIe slot with a working GPU results in the same error except the bus number is changed to reflect the new PCIe slot it is in.

I ran gpu_burn for 1hr with max settings. Thermals never climbed above 80C and power maxed out at 450W. There were no errors during this time.

After idling for some time, in the span of 2-3 hours, the GPU will again fall off the bus.

I am open to try any suggestions. I suspect it is a hardware issue as swapping out all variables, the GPU itself remains a constant.
nvidia-bug-report.log (6.9 MB)
Here is the bug report log when the crash happened randomly while idling.

Along with the bug report log for when GPU burn was ran with no errors, until falling off the bus later by itself.
nvidia-bug-report.log.gz (2.4 MB)

My motherboard BIOS updates also do not indicate any potential fixes for this error, and I am advised from updating the BIOS haphazardly.

2 Likes

Hello,

I had ā€œXID 79, GPU has fallen off the busā€ every time I used my laptop GPU and it idles.
On Windows, I solved the GPU crashes by turning off ASPM with the setting ā€œLink State Power Managementā€.
On Linux, I tried to disable ASPM withpcie_aspm=off as kernel parameter, but it didn’t solve the problem.

Edit: A better solution is described below

I solved the issue by following instructions to disable ASPM for a targeted PCI device. I found the tutorial on this website:

For example, this is the command that solve the problem on my hardware:
setpci -s 00:01.0 0x50.B=0x40
The command depend on the root complex and the control register of the GPU.

To make it permanent, I have made a systemd service that execute the command on each boot:

[Unit]
Description=Disable ASPM for the Nvidia GPU

[Service]
Type=simple
ExecStart=/bin/bash -c 'setpci -s 00:01.0 0x50.B=0x40'

[Install]
WantedBy=multi-user.target

By the way, I’m still surprised this worked.
If you try it, let me know if it works for you.

Edit:

I’ve found a far less hacky solution.
It’s possible to control ASPM via sysfs by writing in /sys/bus/pci/devices/.../link/ as administrator.

So now, in my case the commands are:
su
echo 0 > '/sys/bus/pci/devices/0000:01:00.0/link/l1_aspm'

And my unit file is:

[Unit]
Description=Disable ASPM for the Nvidia GPU

[Service]
Type=simple
ExecStart=/bin/bash -c "echo 0 > '/sys/bus/pci/devices/0000:01:00.0/link/l1_aspm'"

[Install]
WantedBy=multi-user.target
2 Likes

This did not work for me, with an eGPU connected via TB4. I suspect the culprit is the PSU of the eGPU enclosure.

Does this occur when your eGPU is idle or under load?

Both; but more often when idle - I suspect its a loss of power due to incorrect power management. Trying out the new 570.125.124.04 driver release - seems to be much better.

I have the 570.124.04 driver too (the open version). And every thing work surprisingly well (except for crashes when ASMP is activated).

What is the ā€œopen versionā€ - how do I tell if I have that?

Not an Ubuntu user, but if it is like other distros, you have now choice for proprietary nvidia drivers between the original architecture or the new ā€œopenā€ (the kernel part of the driver is open source, whereas the proprietary part is now a firmware loaded in the card). Not to be confused with nouveau. See Open Source NVIDIA driver available with Ubuntu, but user action is necessary to switch from original driver to new ā€œopen kernelā€ driver using the ā€œAdditional driversā€ tool : r/linux for some instructions.

Using pop-os (System76); looks as though the default install is (in fact) the ā€œopenā€ kernel package.

Can you prevent the ASMP from being activated with the ā€œopen kernelā€ drivers?

Will your method (detailed above) work?

I do not know, sorry. There are a few things that cannot be deactivated using the open driver, such as GSP as far as I know.
Note the open driver will eventually replace the older one. For Blackwell it is even a prerequisite.
You can find more info on the open driver in the documentation: Chapter 44. Open Linux Kernel Modules

I don’t think there are major differences for ASMP between the two driver versions but I am not sure.

If you use the second method I found and do sudo lspci -vvv, you should have something like that for your GPU:
LnkCtl: ASPM Disabled;
instead of LnkCtl: ASPM L1;

For both methods using a GPU monitoring app like nvtop I see the idle RX and TX rate of the GPU rising:
It go from RX: 300.0 KiB/s TX: 300.0 KiB/s to RX: 500.0 KiB/s TX: 500.0 KiB/s

Okay - yeah, I can’t use nvtop in pop-os because of the way they install drivers. But I’ll try this approach and post back.

Gives me this:

LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-

And looks as if the driver is always on:

$ systemctl status nvidia-persistenced.service
ā— nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static)
     Active: active (running) since Thu 2025-03-27 07:22:58 EDT; 1min 8s ago
    Process: 1375 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose (code=exi>
   Main PID: 1379 (nvidia-persiste)
      Tasks: 1 (limit: 76504)
     Memory: 1.0M
        CPU: 2ms
     CGroup: /system.slice/nvidia-persistenced.service
             └─1379 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose

No change; the new driver made a huge improvement, but occasional freezes

You have successfully disable ASPM and it did not work, so maybe your issue is not be related to mine. I think in my case, ā€˜fallen of bus’ only happens when idle for a moment.

I’d like to point out that every time you reboot or plug the graphics card, settings are reset and ASPM is reactivated. I say this to make sure that ASPM wasn’t reactivated during your tests.

I hope you find a fix, it took me quite a while to find one.

1 Like

Each new driver, now the 570.133.x driver improves things, so maybe NVIDIA will fix it.