Random Xid 61 and Xorg lock-up

I just picked a medium high clockspeed I saw in the nvidia-settings applications when PowerMizer was in adaptive mode. Something not so low that the memory transfer rate stays off the lowest setting.

J.

With latest version 460.39 from official site, I have not had any freezes so far without manual setting of frequency. The mode is Adaptive. Aside from that, there was a kernel upgrade today which again made a black screen upon reboot. Then I realized that Nvidia driver has to be rebuilt with .so files copied into the kernel folders, which lead me into the installation of DKMS and the relevant --dkms parameter added upon reinstallation. Nothing happens so far.

It is notable that by default Nvidia creates an Xorg configuration file itself for the best performance. However, this file does not exist in Ubnutu by default but instead, Ubuntu uses another series of configuration files in /usr/share/X11/xorg.conf.d to define the rules. Nvidia instead creates an xorg.conf in /etc/X11 to override any conflicting settings in the former directory. Can it be that the problem is caused by the absence of xorg.conf?

Ok coming to report again. So far, I have had no problems with the manually installed official driver. It is worth to note that Ubuntu repository has just updated their driver from 460.32 to 460.39, which may remove the problem from their end as well. That being said, I do not wish to struggle for another round to revert back to the Ubuntu repository driver just to get the same version. If a new version is launched in the future, I will surely notice and try to do it by then. In the meantime, Xorg has an update alongside with it.

It is also worth to mention that from time to time there may be circumstances that one will find delays of keyboard (but not mouse in Ubuntu as I said the mouse was not affected just like others in the same shoes) when the GPU is under load like playing WoW in some scenes, but not all. This is also highly related to Xorg and may relate to libinput as well. It would be recommended to use evdev to manage the input devices, or let Nvidia create its own Xorg file to maximize the chance to remove that problem. From my experience, it showed up very often when using libinput under Manjaro, but eased up after switching to evdev.

i’ve had XID free life but the happiness vanished in last 3 months. With driver upgrades, and ASUS BIOS upgrades, i’m now getting freezes in Linux and Windows. Logs indicate nothing. Sometimes I get XID 31, and the nvidia-smi hack doesn’t help anymore. Also the freezes come more often, even every 10 minutes. This is happening with several distributions of Linux, recent 5 drivers, and also Windows. Sometimes NVIDIA card will not even initiate in POST and ASUS mobo will throw white warning light (GPU problem). Million of BIOS setting combinations didn’t help.
I’ve had enough, won’t ever build PC and I just want to sell the NVIDIA card, but all other cards are out of stock due to usual Bitcoin pandemic.

I’m on the same boat as many of the folks here - weird lockups on Linux. The weird part is that I get lockup the first time I power up the machine after a break. Sometimes it locks during splash screen, sometimes in gdm, sometimes I get to log in and get the lockup shortly afterwards (in the latter first thing that happens is jerky cursor movement, then lockup).
If there’s a need I can work out logs, but after a brief look (and sadly with a bit inexperienced eye) nothing really caught my attention.
After reset machine is working fine (until next poweroff / powerup cycle).
I’ve talked with some fellow IT guys and I’ve noticed a similar pattern - most that have Ryzen and Nvidia combo complain about this lockup (one mentioned that to his surprise that when he’s launching some app through wine on boot he’s not getting it, but I haven’t had a chance to verify whether this really helps on his machine).
I’ve tried setting the clocks on 980, but when I try it I get:

Setting locked GPU clocks is not supported for GPU 00000000:09:00.0.

I’ve set coolbits to 28 and I see extra options in nvidia-settings, but still got the same message when I try “nvidia-smi -lgc 1392,1392”.
Any suggestions what I’m doing wrong? I’d love to try the workarounds as those lockups become super annoying (and I had to disable suspend as well - machine locks after waking up).

have you tried updating your motherboard firmware? that 100% fixed all these problems for me.

Since upgrading the mobo BIOS firmware it’s been rock solid for me also.

yeah, I did update BIOS on my Asrock SteelLegend B550 - sadly didn’t help.

@Przemas You have a Maxwell gen gpu, setting clocks using -lgc is not supported on those. Also, those shouldn’t have been affected by the bug in this thread. From your description (only happening on cold boot, gone on reboot after a while), this rather sounds like failing capacitors on your gpu board.

ouch - taking current gpu shortage into account (and thus inability to get a new one at any sane price) this sounds pretty bad. I’ll switch to nouveau for a couple of days and will see whether I’ll see similar hangups with it as well. If not that would make capacitor issue less likely (at least I think so?).

With nouveau the gpu runs only at minimum clocks so many issues don’t appear since it’s only on minimum power draw.
Put simply, nouveau runs even on many broken gpus.

ok, so sadly the lock does not happen with nouveau - thus I don’t have a definite proof it is failing capacitor. Damn. More investigation coming.