Random Xid 61 and Xorg lock-up

Hello Same on HP Omen 15-en0004AX
Spec Ryzen 7 4800H ,2x512gb Nvme drive, Nvidia GTX 1650 Ti, 16GB RAM
OS: Arch Kernel 5.8
Nvidia Diver: 450.66
Session KDE Plasma Xorg
None of the solution worked so far. I am getting this error when Ideling and on Regular usage (more frequently in Idelling). Suddenly CPU may got 100 percentage mouse will work for few seconds no commands or logging mentioned here will work on this state, after that entire system crash no ssh nothing, need to restart to recover.
Frequency hack mentioned here can delay it some what , But it very frequent for me the longest stretch i am able to use desktop is about 2:30 hours

Some logs from crash captured using kdump

There’s nothing in the release notes which could indicate that the bug has been solved but I’ll try anyways. Thank you.

Again, unlike most people here I have a motherboard based on the X570 chipset and my GPU runs via a PCI-E 3.0 interface.

Are the system crashes still being looked into? I keep getting xid 79 “GPU has fallen off the bus” now, usually after messages like:

pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00000040/00006000
pcieport 0000:00:03.1: AER: [ 6] BadTLP

I’m on first-gen Ryzen with an X370 Taichi.

Hello Amrits,
Thanks for the 450.66 update. I have not encountered the xid61 error since I updated to this driver version 3-4 days ago.
I am on Mobo: ASUSTeK model: ROG STRIX X570 -F GAMING
AMD Ryzen 9 3950X
GeForce RTX 2080 SUPER
Regards

just as an update - I never had a single Xid 61 lockup on ubuntu with the clock hack we settled on,since I wrote about it in message 214 June 9.
I just installed the newer driver (450.66) and took off the hack - can confirm that the gpu clock is now running idle at much more sane speeds which is nice to be honest. I will report back if the new driver has fixed anything.

I also noticed from here:
https://www.nvidia.com/download/driverResults.aspx/163238/en-us
that the additional information tab mentions some other workarounds for bugs I hadn’t heard of:


- Disable flipping in nvidia-settings (uncheck "Allow Flipping" in the "OpenGL Settings" panel)
- Disable UBB (run 'nvidia-xconfig --no-ubb')
- Use a composited desktop

are these relevant at all?

Update
Changing to xanmod kernal and updating grub with kernal parameter “pci=nommconf” fix this i think xanmod doesnt have any effect though.
For me PCIe bus fail on this issue rendering my system unusable.(no ssh , display sound hard disk mount wifi etc. It seems that everyone in the pcie bus goes missing when this happens but enabling “pci=nommconf” has done something which kind of fix this. ) Anybody have any idea why this configuration fix for me.

Hi elialbert,
thanks for coming back and reporting! I also installed 450.66 now and disabled our clock-fix workaround. I will let it run now for some time to see if the Xid-61 occurs again.

Thanks vinuvnair and elialbert for the update.

@Uli1234
Will await for your test results, thanks.

I am having the same issue on a B450 board too. I was having 6 to 8 crashes every day. No issues on the other OS. For me this:

nvidia-smi -pm ENABLED; nvidia-smi -lgc 1000,1815

made it so I did not have a crash in the last 3 days. Thank you very, very much… It was beginning to drive me crazy.

OS: Pop!_OS 20.04 LTS
MB: B450 GAMING PRO CARBON AC
CPU: R5 3600
GPU: RTX 2080 FE

1 Like

Any confirmation on if this “fix” was pushed to the WIndows Driver?

Not had the issue myself on WIndows since locking the card at Prefer Maximum Performance, but that also locks my card at maximum frequency 24/7 which also means constant heat.

I’ve updated to 450.66, and I’m still experiencing the Xid 61 error.

Hardware:

  • MB: Asus ROG STRIX X570-E Gaming (latest BIOS)
  • GPU: Asus Phoenix GeForce GTX 1650 Super
  • CPU: Ryzen 3900XT
  • OS: Fedora 32 (latest updates)

My situation is a little different than others.

After installing the proprietary driver (450.66) and with no other customization, it would predictably hang within a few minutes of starting the system. Every time. When hanging, the mouse cursor would move, but nothing else changed on the screen, Xid 61 was visible in the dmesg output, and nvidia-smi would show the ERR output.

With the nouveau driver, I didn’t have any issues (but I only used it for a few hours).

After applying the suggested fixes, nvidia-smi -pm ENABLED and nvidia-smi -lgc 1000,1740, stability improved dramatically, but it still froze again in less than 24 hours.

At this point, I’ve locked the frequency with nvidia-smi -lgc 1740,1740, and I’ve made the suggested PowerMizer “Preferred Mode” change.

I have a couple hours of uptime since my last hang, and I have my fingers crossed… (I will report back.)

(And a huge thanks to all those who’ve helped with the workarounds! Without you, I wouldn’t have been able to type this message without experiencing a crash.)

I can confirm that after upgrading to the 450.66 driver I have not experienced an Xid 61 error (uptime about 4-5 days now). I am still locking the GPU frequencies to (700,1680) to keep it out of the lowest power state (P8, I believe); if I need to reboot again soon I will leave the frequencies alone and see what happens.

Previously I would encounter 1-2 Xid 61 errors per day, requiring 1-2 reboots to free up the graphics system. Whatever the change in the driver, that seems to at least be preventing the error from occurring in my system.

System:

  • ASRock X470 Gaming-ITX/ac with 32GB of memory (no overclock or XMP)
  • Ryzen 3700X (hyperthreading enabled, no overclock)
  • EVGA RTX 2060KO
  • Ubuntu 18.04 and Ubuntu 20.04

Thanks @amrits, @Uli1234 for your detective work.

@amrits, if possible for you to share, I’d be very interested to hear (and others probably would be too) how this was fixed in the driver.

… and, at just over 2.3 hours of uptime, I’ve had it hang again. I had walked away from the machine (it was idle), and when I returned, I found it frozen. I was still able to connect to it via SSH, confirm the Xid 61, and see the ERR in the nvidia-smi output.

Sep 17 16:04:57 ryzen kernel: NVRM: GPU at PCI:0000:0a:00: GPU-56625a86-54d2-7b0d-a55c-ac9736570e41
Sep 17 16:04:57 ryzen kernel: NVRM: GPU Board Serial Number: 
Sep 17 16:04:57 ryzen kernel: NVRM: Xid (PCI:0000:0a:00): 61, pid=2154, 0d02(31c4) 00000000 00000000

I’m currently investigating a theory that my Xid 61 errors are related to my motherboard’s SOC voltage setting. The voltage was automatically increased when I selected the DOCP for my 3600 MHz RAM. I’ve now overridden the SOC voltage to a lower value, and I’m seeing significantly improved stability. I’ll follow up in a few days…

yep, that’s what i said here . the popular Ryzen freq 3600 is a contributor. after 450, i also disabled the SMI fix and so far live xid-free life.

@services_nvidia1, you were certainly on to something with the RAM speed! :) Before I first posted the other day, I had read your post along with all the others in this thread. When I now look back at your post, I’m sorry I didn’t give enough credit to its second half, i.e., “AND lowering main memory speed.” I had focused on the “nvidia-smi” patch.

After lowering the SOC voltage (and not applying any other workarounds), my system, which would previously consistently encounter an Xid 61 error within a couple minutes of booting, has now been running for 11 hours. In the kernel logs, I should also note my CPU was reporting an error every few hours, e.g.,

Sep 20 15:58:40 kernel: mce: [Hardware Error]: Machine check events logged
Sep 20 15:58:40 kernel: [Hardware Error]: Corrected error, no action required.
Sep 20 15:58:40 kernel: [Hardware Error]: CPU:0 (17:71:0) MC27_STATUS[-|CE|MiscV|-|-|-|SyndV|-|-|-]: 0x982000000002080b
Sep 20 15:58:40 kernel: [Hardware Error]: IPID: 0x0001002e00000500, Syndrome: 0x000000005a020001
Sep 20 15:58:40 kernel: [Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2, Link Error.
Sep 20 15:58:40 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)

Those errors have stopped, too.

Some concrete numbers on the SOC voltages (“VDDCR SOC Voltage”):

  • default (without selecting DOCP): 1.025 V
  • DOCP value: 1.1 V (encountered Xid 61 errors)
  • manual override: 1.04375 V (stable, so far)

Congrats! btw 3466 sorted out the clocks for me plus game me higher performance in graphics (games), it was tested by good sources to have 2x higher MIN_fps perf (which matters for stutters) vs 3600. Is it true? Well, at least it helps stability. Blabbed about the divider here.

Now the nvidia issuefest is shrinking in size, and issues leave the party as time goes on, i found 2 more solutions:
[Status][Priority]Issue:Notes

  • [OK][High] Freezefest with Xid61: fix is to reduce clock or limit min PCI mode with nvidia-smi
  • [OK][High] blackscreen after login or resume - can be fixed with nvidia-settings advanced resolution settings, trying combinations, and sync it with desktop manager.
  • [KO][Medium] some windows empty after resume - cannot be fixed, you need to maximize window to refresh each time
  • [KO][Low] Waylandfreeze - have to wait few more years, but it’s ok, stay with X11 and wait
  • [OK][High] framerate drop in firefox after resume: this was top priority as scrolling was poor, fix is to set layout.frame_rate to your target fps

Remaining issues were reproduced on 10 distros, driver 440,450,455, kernel 5.4,5.5,5.6,5.7,5.8. No, there’s no lucky distro #Distrofest :)

Update via Ubuntu came 3 days ago to 450.66:
3 days went well, then:

Again system freeze. This time not x61, but had to hard reset whole system.

Sep 25 22:40:35 hostname kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Sep 25 22:40:35 hostname kernel: NVRM: GPU Board Serial Number:
Sep 25 22:40:35 hostname kernel: NVRM: Xid (PCI:0000:0b:00): 8, pid=2629, Channel 00000020

OS: Ubuntu 20.04.1 LTS x86_64
Kernel: 5.4.0-48-generic
AMD Ryzen 9 3900X
Asus ROG STRIX X570-E GAMING
Asus RTX 2070S ROG STRIX
NVIDIA Driver Version: 450.66

Going back to: sudo nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2000;

@thecakemaster From your system specs, we have the same MB and nearly the same processor. I have a Ryzen 9 3900XT. I’m curious whether you’ve looked at the VDDCR SOC voltage in the BIOS?

I did not.