Random Xid 61 and Xorg lock-up

@jameskzd28

1.) Are you sure the system didn’t reboot during that 4 days? Would help putting the setting for PowerMizer in a startup script

2.) When the issue appeared again after 4 days, are you sure it’s your card #0 that triggerd the issue? Might be a try to apply the PowerMizerSetting to all three cards

3.) Instead of PowerMizer you could try to lock the frequencies as explained in this thread. For example 1000MHz-2000MHz. In my case when having a min. freq of 1000MHz the card always stays in P0.

@Uli1234 thanks for all of the work you’re doing to track down this issue

1.) I’m sure the system didn’t reboot during the 4 day period. I ran a simple script to monitor the situation and that script would not have been running if the system had been rebooted.

2.) I’m 90% sure it was still card #0 but I’ll check again next time to be sure. Just yesterday after I posted I saw that card #0 went to P8 briefly a couple hours after the PowerMizer setting was applied.

3.) That’s a good suggestion, thanks.

Just to follow up, the issue reappeared just now and it was still with card #0, with no intervening reboots. Going to try the frequency locking approach now.

@Uli1234 The Xid 61 error happened to me again just now. And I’m sure that the settings were applied.

Also, I experienced a different segmentation fault error recently, like @dawdaw did.

@jameskzd28 Not good news. I would try the following now:

1.) Set persistence mode
2,) Set PowerMizerMode (Max Performance)
3.) And in addition lock the frequency to min of 1000MHz

Double layer approach. Might be worth a try…

@han310 Thank you for the feedback. Could you try to set MaxPerformanceMode as well as locking the GPU frequencies?

Just confirming this, After switching back to my GTX 1060 from my RTX 2070 Super, I’ve had no slow downs for 36 days.

Also fwiw, I think something is still glitchy with the GTX card based on the the fact that chrome now freezes up every now and then and I have to restart it. I suspect where the RTX card would cause the entire system to grind to a halt, the GTX is more stable and only the application in use is affected. Maybe something to do with the turing architecture code?

@OldToby What’s your status? Did the issue occur again at your system?

Have this issue on both Windows and Linux. As others have pointed out, it only seems to happen when the GPU goes into a low power state. Doesn’t seem to happen if forced to stay in low power, also doesn’t happen if kept in a high power state. Most reliable way I can reproduce it is by doing stuff that causes the GPU to flip flop in and out of low power, but even then, it still seems very random.

For me, generally happens every 1-2 weeks or so.

Ryzen 7 3700X
MSI RTX 2070 Super Ventus GP

I locked the frequencies and set the powermizermode to max. Will let you guys know if it occurs again.
Let’s hope nvidia is making progress towards resolving this issue.

1 Like

IMO, I don’t think clocks are related here. It only happens for me when the GPU is actively doing something, a little different than some here. It almost always happen when in a WebEx session (chrome using the GPU) or at least that’s when it’s the biggest problem. These days most of my apps are basically Chrome wrappers. Slack (Electron), Outlook (Chrome App), WebEx (Chrome website).

Here is a rather long thread about the same issue on Windows as some others were referring to. https://forums.tomshardware.com/threads/hang-freeze-crash-event-id-14-nvlddmkm-amd-nvidia.3594431/

@Uli1234

So after about2/3 weeks, I do not think I had a xid61 lock-up. However, I have had a different issue which may or may not be related to changing the minimum clock speed. Will investigate.

Thanks for the Feedback. If you have more information on your new issue I am interested to hear about it.

1 Like

For me (system specs see above), I have an uptime of 32 days now, without any crashes. This is the first time this year I have an uptime of more than a week.

This is after
sudo nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2115

1 Like

Just another entry in the “also experiencing this issue” category. Haven’t seen an x470 chipset mentioned, so maybe that’ll be something? (hah)

AMD 3700X
AsRock Fatal1ty x470 Motherboard (BIOS v3.3)
EVGA 2060 KO
Ubuntu 18.04, Awesome WM, NVidia driver 440

Going to try the power/clock setting solution people have posted.

Update: Had a system lockup the same day I set the power settings. Although I was able to get a shell open and shutdown gracefully (which I have not previously been able to do). Got the NVRM: Xi 61 log message right before it happened.

Thanks for the feedback. It seems setting the power settings is not as effective as locking the GPU frequencies. I guess it comes from the fact, that MaxPerformanceMode sets the preference to MaxPerformance but doesn’t prohibit the GPU to lower the clock frequencies. So maybe after some idle time MaxPerformanceMode also decreases those. Or it might be a temperature thing. When temperatures are to high in the GPU it will throttle down.

2 Likes

Sorry, I meant the frequency locking when I said “power setting” (conflated the two in my head since frequency locking impacts power). I’m hesitant to try the “Prefer performance” setting since it dramatically increases my computer’s power usage. I don’t really consider that to be an acceptable long-term fix.

Additionally disabling SMT did not work for me, and I still incurred an NVRM: Xid 61 message. Worth noting though is that the “lock up” this time was not particularly bad, only when launching a new GPU process did it stutter, but the system was otherwise quite usable. Still required a reboot to fix though.

Right now I’m testing disabled SMT alongside disabled XMP RAM settings.

When I boot up, often I get a “PCIe error BadTLP” log message (almost exactly the same as what this person is reporting https://unix.stackexchange.com/questions/543219/why-is-journalctl-reporting-pcie-bus-error-badtlp-and-baddllp), however with the XMP settings disabled I’m not getting this for the first time ever. I also see a “nvidia-gpu***: i2c timeout error e0000000” though I was getting that all along.

If the XMP settings change doesn’t help I may try the performance setting, however I may just buy a replacement motherboard and hope that fixes it. The cost of running the GPU at performance settings would probably pay for a new motherboard by the end of the year!

It’s a shame that neither NVidia nor AMD have been able to diagnose this issue. Makes the computer nearly unusable for serious work.

Edit: I also wanted to mention that my GPU is plugged in via a riser cable (small form factor case). Seems unlikely that that would cause any issues, but again, more information (and maybe other people have theirs plugged in via riser and are noticing this issue too).