I am running Ubuntu 22.04.05 with Nvidia Kernel Module 560.35.03
The CPU is an AMD EPYC 7702, motherboard is AsRockRack ROMED8-2T.
The graphics card has a 1500W 80+ Platinum PSU available to it. The motherboard and other system components are on a separate power supply.
I have reseated the PCIe connections. I have swapped the power cables for the GPU. I have tried this GPU in another PCIe slot that is confirmed working properly. The power supply connected to the GPU has another GPU also connected that has reported no errors.
Swapping this GPU into another PCIe slot with a working GPU results in the same error except the bus number is changed to reflect the new PCIe slot it is in.
I ran gpu_burn for 1hr with max settings. Thermals never climbed above 80C and power maxed out at 450W. There were no errors during this time.
After idling for some time, in the span of 2-3 hours, the GPU will again fall off the bus.
I am open to try any suggestions. I suspect it is a hardware issue as swapping out all variables, the GPU itself remains a constant. nvidia-bug-report.log (6.9 MB)
Here is the bug report log when the crash happened randomly while idling.
Along with the bug report log for when GPU burn was ran with no errors, until falling off the bus later by itself. nvidia-bug-report.log.gz (2.4 MB)
My motherboard BIOS updates also do not indicate any potential fixes for this error, and I am advised from updating the BIOS haphazardly.
I had āXID 79, GPU has fallen off the busā every time I used my laptop GPU and it idles.
On Windows, I solved the GPU crashes by turning off ASPM with the setting āLink State Power Managementā.
On Linux, I tried to disable ASPM withpcie_aspm=off as kernel parameter, but it didnāt solve the problem.
Edit: A better solution is described below
I solved the issue by following instructions to disable ASPM for a targeted PCI device. I found the tutorial on this website:
For example, this is the command that solve the problem on my hardware: setpci -s 00:01.0 0x50.B=0x40
The command depend on the root complex and the control register of the GPU.
To make it permanent, I have made a systemd service that execute the command on each boot:
[Unit]
Description=Disable ASPM for the Nvidia GPU
[Service]
Type=simple
ExecStart=/bin/bash -c 'setpci -s 00:01.0 0x50.B=0x40'
[Install]
WantedBy=multi-user.target
By the way, Iām still surprised this worked.
If you try it, let me know if it works for you.
Edit:
Iāve found a far less hacky solution.
Itās possible to control ASPM via sysfs by writing in /sys/bus/pci/devices/.../link/ as administrator.
So now, in my case the commands are: su echo 0 > '/sys/bus/pci/devices/0000:01:00.0/link/l1_aspm'
And my unit file is:
[Unit]
Description=Disable ASPM for the Nvidia GPU
[Service]
Type=simple
ExecStart=/bin/bash -c "echo 0 > '/sys/bus/pci/devices/0000:01:00.0/link/l1_aspm'"
[Install]
WantedBy=multi-user.target
Both; but more often when idle - I suspect its a loss of power due to incorrect power management. Trying out the new 570.125.124.04 driver release - seems to be much better.
I do not know, sorry. There are a few things that cannot be deactivated using the open driver, such as GSP as far as I know.
Note the open driver will eventually replace the older one. For Blackwell it is even a prerequisite.
You can find more info on the open driver in the documentation: Chapter 44. Open Linux Kernel Modules
I donāt think there are major differences for ASMP between the two driver versions but I am not sure.
If you use the second method I found and do sudo lspci -vvv, you should have something like that for your GPU: LnkCtl: ASPM Disabled;
instead of LnkCtl: ASPM L1;
For both methods using a GPU monitoring app like nvtop I see the idle RX and TX rate of the GPU rising:
It go from RX: 300.0 KiB/s TX: 300.0 KiB/s to RX: 500.0 KiB/s TX: 500.0 KiB/s
You have successfully disable ASPM and it did not work, so maybe your issue is not be related to mine. I think in my case, āfallen of busā only happens when idle for a moment.
Iād like to point out that every time you reboot or plug the graphics card, settings are reset and ASPM is reactivated. I say this to make sure that ASPM wasnāt reactivated during your tests.
I hope you find a fix, it took me quite a while to find one.