My system crashes with Xid 61 after a few days. If I set nvida-smi -lgc 1000,2145, I also get an additional Xid 38 after Xid 61. X stays frozen when these two happen contemporarily, and I am unable to use any window. If only Xid 61 happens, I can still use some windows and close programs using the GPU, but no way to reset the GPU since nvidia-smi says it is in use (X); if I close X, monitor goes off without the GPU returning, forced to reboot. I have seen the errors within intervals of 4, 5, 7, 8, and up to 10 days. Leaving a terminal with the following command:

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.current,temperature.gpu,clocks.gr,clocks.mem,power.draw --format=csv -l 60

Shows me that the card never goes to P8, the lasts states are P5 with gen 2. I will now increase the lgc to 1300,2145 and decrease the nvidia-smi query command’s interval to 10 seconds to see how it reacts.

Here is my system information:
Motherboard: TUF GAMING X570-PLUS
CPU: AMD Ryzen 7 3700X
Kernel/OS: Linux 5.6.16/Gentoo
Drivers: nvidia-drivers-440.82-r3

I’m another affected user.

OS: Kubuntu 20.04 x86_64 5.4.0-40-generic
GFX: NVIDIA GeForce 2070 Super (MSI Gaming X Trio)
Driver: 440.100 CUDA Version: 10.2
Motherboard: ASRock Taichi TRX40
Processor: AMD 3960X
Monitor: Acer CB280HK (Displayport 4K 60hz)

Symptoms are the usual:

  • happens spontaneously, sometimes days/weeks without problems
  • after bug hits I can still ssh in without problem
  • usually one process is pegged at 100%
  • nvidia-smi and any other GPU related process hangs if started
  • only cold reboot fixes the problem
  • Xid 61, sometimes followed by an Xid 8

I can provide dmesg,syslog and kernel logs for at least 11 occasions. As for frequency of the bug, this month alone on July 7, 8, 9, 15, 20, 21, 22 and 2x on 23.

Since today I’m on the “nvidia-settings -a [gpu:0]/GPUPowerMizerMode=1” regime and hoping for the best, though I’m not thrilled about power consumption in this mode.

Gigantic thanks to @Uli1234, you are the man!

I have a Ryzen 9 3900x on Asus Pro WS x570-ACE. I have flashed the BIOS to the latest version (2103, released on 29/06/2020 by Asus) and even without changing the base clock, it may solve the problem (4th day w/o crash). [edit] I’m on Ubuntu 20.04 with everything very up to date, and NVIDIA-440.100 for my GTX 1650 Super (from Gigabyte).

Nvidia-settings reports “current pcie link speed”, jumping from 2.5GT/s to 8.0GT/s.

ps: for me, if I didn’t use video (youtube, VLC, etc.) it didn’t crash (rapidly?). It almost always crashed while playing a video.

I can’t update my previous post, so I will post the update here:

I worked 10 hours with my computer using the solution proposed by @Uli1234 during Wednesday, Thursday and Friday and I got no errors. I think this temporary solution works for my current setup.

If you’re still experiencing hangs, even after locking your GPU’s clock frequencies, keep an eye out for audio (maybe others) drivers/modules/etc. that may also mess with the power state of the GPU.

In my case, I was getting Xid 61 after other seemingly unrelated reports from “snd_hda_intel” which attempts to auto discover and configure audio sources (https://docs.slackware.com/howtos:hardware:audio_and_snd-hda-intel). But they would always occur together, so after about the 3rd reboot, I started getting suspicious.

One of the other things snd_hda_intel apparently does is attempt to put audio devices to sleep to save power, which, given that these errors and hangs we’re experiencing are related to switching power modes, seemed like a likely culprit.

I’ve since added the file /etc/modprobe.d/audio_disable_powersave.conf that just has the body “options snd_hda_intel power_save_controller=N”. You can also run

echo N > /sys/module/snd_hda_intel/parameters/power_save_controller

as root however that will likely reset after a reboot.

Going on 3 days now (FINALLY) without having to reboot the machine due to Xid 61 stuff.

Thanks for that useful information. Since audio is often integrated within the graphics card it makes sense to have a look at that module too.

Hi, I am not sure if the PowerMizer setting really prevents low power modes (=switching to PCIe Gen1) or if it just is a preference. I would trust more the locking of the GPU frequency. If the issue occurs again I would try that as a next step.

Did the min freq of 1300MHz worked for you?

I still haven’t gotten the error, and my system has been up for 4 days. Will report back if anything happens.

You can give your card a frequency range that it can work within. It just shouldn’t switch down to PICe gen1 or P8 state.
Locking it at 1600 is, in my opinion, an option but not the best one (regarding power consumption and heat). I would try to go with 1000-1800 or 1300 -2000. I think the exact range differs from the model you have. You can play around a bit with the settings

Thank you for the reply.

Well, right now I’m already a few days in without any hiccup so far. Finally!

So if the situation remains stable for the next four weeks I’m going to try to limit the frequency directly as I’m rather wary of running in always-on performance mode. Still better than who-knows-if-your-machine-is-still-working mode.

Any information on an official fix? I’d guess that this is a rather brutal bug for customers to contend with. I wonder… have all those in this thread been suffering for months or just tried to RMA the affected cards. Before you contributed your workaround it was basically a Windows 98 experience for those affected, stability-wise. :) I know that I was growing irrationally restless during this period.

11th day without a crash (after flashing the bios to the latest version 2103).

Is it possible it was a x570 / x470 Bios bug that threw the card in a state it could not get out of?

I got Xid 61 (followed by a Xid 38) today, after 7 days of uptime. Will try min freq of 1400 MHz.
The GPU state was like this:

It crashed with smi/irq showing up in top (they didn’t eat 100% CPU, just 60% or so in one core/thread) after 12-13 days.

Will now try with overclock

I don’t know if anyone saw my post earlier linking to a thread showing the same problem on Windows. After updating my BIOS to a version with AGESA PI the problem has gone away. It’s been a month now and I haven’t had the problem once and I was seeing the problem once or twice a week before. I’m curious if the problem the NVIDIA driver is seeing was due to some bug in AMD’s firmware that exposed an edge case that wasn’t be handled. Having a fix from both sides would be extra good.

i never go to P8/PCI1 and still get those XID freezes regularly. I worry it’s a placebo fix. I just got new BIOS update with hot new PI

sometimes a sudden reboot, Xorg.log last message:

(EE) client bug: timer event12 debounce short: scheduled expiry is in the past (-9ms), your system is too slow

and of course resume problems like windows don’t repaint after resume or even after just pre-suspend monitor turn off and on (when you move the mouse to cancel the suspension). Firefox would reset high framerate to low framerate and has to be restarted. Then compositor problems, effects are slow (frozen for a second). What a great Linux experience on AMD chipset, and ASUS mobo. I switched to RTX card after trying several models of AMD gpus as that experience was even worse. Those cards didn’t even boot.

it has not been a placebo fix for me. literally no crashes since the fix in more than 3 months (previous frequency once per week)

evidence of people say otherwise, looking at the replies since this workaround. i’m getting the freeze several times a day, with literally no transition of power states:

GeForce RTX 2070 SUPER, 00000000:07:00.0, 440.100, P5, 2, 54, 660 MHz, 810 MHz, 14.98 W
GeForce RTX 2070 SUPER, 00000000:07:00.0, 440.100, P5, 2, 54, 660 MHz, 810 MHz, 15.09 W
kernel: NVRM: Xid (PCI:0000:07:00): 61, pid=425, 0cec(3098) 00000000 00000000
GeForce RTX 2070 SUPER, 00000000:07:00.0, 440.100, P5, 2, 54, [Unknown Error], [Unknown Error], [Unknown Error]
GeForce RTX 2070 SUPER, 00000000:07:00.0, 440.100, P5, 2, 55, [Unknown Error], [Unknown Error], [Unknown Error]

will now try 1) SMT 2) power_save_controller 3) turn off internal audio

For reference, the frequency fix works great for me. Previously, my system was freezing at least once per day. Since using the fix, it has never crashed (only that one time when I forgot to set the minimum frequency :D).
This is on Ubuntu 20.04, Ryzen9 3900x, RTX2070Super, Asus PRIME X570-P mainboard.