Random Xid 61 and Xorg lock-up

Hi tdb,

It would be great to have remote access of your system as I am still not able to replicate issue locally.

All right, I will set it up and let you know the next time it happens. Is a private message here a good way to communicate the details or would you prefer email?

Yes, we can communicate via private message

FYI hand another repro of this, this time only after 4 days. Not sure if it helps, but xorg and chrome go spinning at 100% when this occurs. If I restart the lightdm service, I get warnings in linux journal saying that the nvidia-modeset lost display notifications for GPU:0.

Hi All,

Sysadmin at a small 3D animation studio here. We recently purchased 30 new workstations with these specs.

System details:
Asus Prime X570-P
Ryzen 3900X 3.8Ghz
Asus Geforce RTX 2070 Super
64GB DDR4 (4X16)
Centos 7.5.1804
Kernel 5.4.0
Nvidia Driver 430.26
LightDM + Mate Desktop

The issues we’ve been getting are exactly the same ones mentioned in this post (station freezes / Xid 61 lockup ). Issue happens randomly. Doesn’t seem to be a correlation between heavy GPU/CPU usage and issue occuring. Often our users would freeze while doing basic emails or browsing in chrome. Do note we have 1 of those workstations setup as a headless server (runlevel 3) and it hasnt frozen at all. Since its not interacting much with the GPU i guess its just not triggering a freeze. We can only reproduce the issue with stations in runlevel 5 + user interaction.

The troubleshooting steps we took :

  • tried playing in the BIOS settings, we activated/deactivated pretty much every possible feature of the ASUS board - no luck. We also played heavily with the boards PCIE settings - no luck there.
  • tried updating the BIOS to the latest version - issue persists
  • We initially were running Centos 7.5.xxx with the standard kernel V3.10.xxx, tried kernel 4.xxx, issue persists, tried kernel 5.10.xx, issue persists.
  • We initially were running nvidia-smi version 410.xxx, we worked our way up incrementally to 430.26, issue was reproduced regardless of nvidia-smi versions.
  • we tried differents GPU’s : RTX 2070 Super, RTX 2070, RTX Quadro 4000 - they all froze (All turing architecture).
  • Once we tried older cards (GTX 1080, GTX 1060, GTX 980, GTX 780) (Pascal, Maxwell and Kepler architectures) we stopped getting the issue. Everything ran smooth and freezing was gone.
  • We ordered previous generation boards (MSI B450-pro) and updated the bios to make it compatible with the new Ryzens + we put the 2070 Supers back in. All issues resolved. We havent gotten a freeze in 2 weeks on those 2 B450 stations (regardless of kernel version,with a B450 it works fine even on kernel 3.10.xxx)

With this kind of isolation done, its pretty safe to say the X-570 boards used with Turing cards are the cause of the issue. Since we have the budget we went ahead and replaced all our X-570 boards with MSI B450-Pros.

The uncertains :

  • not sure if changing to a different brand than ASUS (MSI, Gigabyte) but staying in X-570 gen would have worked
  • not sure if ASUS 450 or 470 is also affected

It may not be an option for some you as changing board can be costly, but at least do know it is a working solution. We’re happy to share more or answer questions if you have any.

sysadminfm9rx: i’m experiencing this issue on a rog strix b450-f board with an athlon 3900x and a RTX 2080 ti.

@collinvandyck: The XID 61 freeze ? with a B450 board ? If so maybe its an ASUS thing on Gen 4 and 5 with Turing cards :( our MSI B450’s have been rock solid.

Yup, the same XID 61 freeze with a b450.

In my last post I changed the power settings to maximum performance. I hit XID 61 after only a few hours (I’ve been on vacation for the last two weeks.) It looked like chrome triggered it this time.

I’m out of ideas to try, so I’ve essentially “disabled” hardware acceleration by starting a vncserver on a virtual display and running only vncviewer on the physical desktop. If this also triggers the issue, I will report back here.

Another repro of this just occurred for me. This time I barely achieved uptime of 1 day. So what else can I give to help diagnose this. I can grab a stack trace of xorg/lightdm if it would help and try to see if we can find where stuff is spinning. Or should we try pulling in some other vendor (though that is probably a bit premature).

Also not sure if it help. I am running a 3 monitor setup (all DP). And one of the monitors will not be recognized on system reboot (linux only, works fine with windows). I have to restart lightdm to get it to work. Also locking out the system (with lightdm) also causes the monitor to be lost. Is anyone experiencing this only using one monitor? Seems like this could be an issue that is more reproducible with 3 monitors (all DP).

@jm4games : On our end with the Asus X570’s, the issue was reproduceable with single and dual monitor setups (we didnt try triple). Numbers of displays or port type (HDMI, DVI, DP) didn’t seem to matter.

@sysadminfm9rx, Thats good to know, but also unfortunate, since it just makes this harder to track down :(

Maybe we should sticky a list of known hardware configurations where this reproduces.

@jm4games I had a similar issue with one of my DP monitors (I have two) not being recognized after Xid 61. Try powering everything down (pull the plug if some devices don’t have a hard power switch) and then back on. Might be that doing it for the monitor only is sufficient.

Looks like everyone affected has:

a Ryzen 3000 CPU
an NVIDIA RTX/GTX 16XX GPU
an X570 motherboard from either ASUS (most commonly) or Gigabyte (less common)

This should give NVIDIA a hint of what might be amiss.

I’ve also experienced this issue but only after enabling a non-standard nvidia kernel driver option, so my case is different I guess.

birdie: the problem also exists on b450 motherboards as well

To summarize affected configurations so far:

CPU Arch: Ryzen 3000 series (zen 2)
OS: Arch-Linux, debian 10, Ubuntu 18
Linux Kernels: 4.+, 5.+
Nvidia Drivers: 43*.+, 440.+
GPU: RTX Series, GTX 16**
Repo Time: 1hr ~ 28 days
Displays: 1+
Display Technology: DP, HDMI (seems more prominent on DP)
Mobo Chipset: X570, B450M
Mobo Venders: Asus, Gigabyte, MSI
XOrg Ver: 1.20.+
PCI-E: x16, x8

NOTE: I have seen at least 3 confirmations that GTX 1060 does not repo this issue (myself being one of them). We suspect this issue
is Turing architecture specific.

Please correct/suggest more for this summary.

You can exclude WM and DM as they are unlikely to have any effect.

There’s no “GTX 2** Super” - you can just say RTX cards ;-) Super are just the same cards with faster VRAM.

I haven’t seen reports from MSI and some people claim MSI motherboards are definitely not affected ;-)

I have the same problem.

My System:
Linux Debian 5.3.0-3-amd64
GTX 1650
i7-3770
Mainboard: MSI B75MA-P45 (Intel® B75)
driver version 440.44
KDE

Mainly apears while gaming.

I had GTX 1050 without any problems until i got the 1650.

I have a additional information,
it’s mainly happen on special content. I play Dota and on some games i have no problems at all, but other will trigger this issue fast and i have to reboot often. I think it depends on some skins or heroes or something. When i watch the replay of the games where the issue occurs i will have the same problem again (so it is reproducible). It feels like the problem can be triggered by special reflections. But lower the video settings and disable things did not help much until now. Additional it stacks, the game starts to be laggy and if I have bad luck it continues until it crashes. In other cases it will not crash. More or less like some sort of memory leak. Killing the game will not make the system usable again, but the context switch makes the System complete useless. If i can issue a reboot command the system hangs for a while but will reboot in the most cases, after the reboot the system acts normally again.