Random Xid 61 and Xorg lock-up

I’m also getting Xid 61 on:

  • mobo: Asus x570 TUF Gaming Plus (WIFI) bios 1407
  • cpu: 3700x
  • gpu: msi rtx 2070 super
  • driver: 440.82
  • os: pop os 20.04 lts (aka ubuntu)
  • de: gnome
  • wm: xmonad (very lightweight)
  • running apps: alacritty (gpu accelerated terminal), chrome, emacs

I get this issue around twice a week in linux, but I don’t in windows 10 on the latest drivers there.

When this issue isn’t happening, nvidia-smi says that xorg is using 1gb of vram, which seems very very excessive (i’m just using xmonad with a few terminals running). I don’t know if it’s related.

When the issue happens, like others have reported, nvidia-smi basically hangs and can’t query some things like temp and vram usage. Xorg is pegged at 100% cpu usage and chrome is also pegged. ssh works but I basically just have to reboot.

I have tried that out as well. Setting processor.max_cstate=1 does not work, the error still occurs from time to time.

I will be trying with 2070 super card today or this weekend with MSI X570 and share my results by early starting of week.
As we know there is not a single method to hit with Xid 61 issue, if anyone is able to recreate it with same steps consistently, please provide the same.

@amrits Is there some extra tool we can be given to help with this? I’m sure we have a couple of decent developers on here that could potentially do more to diagnose this if only told how.

Oh a thought, I think this issue repos more when running a tool like compton (Picom on ArchLinux).

Hi together,
I am encountering the same problem.
I am working on a professional product and with the following configuration:

Server Mainboard: ASMB-815-00A1E
CPU: Intel Xeon 4109T Silver
GPU: Leadtek RTX2070
Ram: 16 GB
OS: Yocto Linux
nvidia driver: latest

After some time I get the xid 61 error and x-org consumes 100% CPU load so that the mouse etc gets laggy.
Only reboot of the systems helps. However ssh login is still possible.

Since I have more than 10 absoutely similar systems I could do some testing.
I switched the graphics cards throughout the systems. I want to give you my results:

1.) Not all systems affected. Some cards seem to work fine on some mainboards. On another mainboard more than 5 cards showed the issue
2.) I observed the moment of the crash. Memory clock was toggling from high frequencies to low frequencies (mclk).
GPU clock (gclk) was also toggling.
3.) On an affected system I can reproduce the issue by setting the GPU clock frequency to the lowest possibel freq.
In my case that is 300 MHz. I use the command “nvidia-smi -lgc 300,300”. When playing aroung with the mouse,
issue occurs within minutes. However if memory clock is forced very high (7000MHz), that happens in a part of the
software where a rendering is done, the 300MHz fixed clock does not produce the issue.
Summary: Issue occurs with low GPU freqs and toggling or low memory clocks (GPU memory)
4.) Did the same within windows 10. I used afterburner to set 300MHz GPU clock. Issue ocurred very fast,
however windows did not crash, there was a blink and everything was fine again (lookup TDR -Timeout Detection and Recovery)
5.) My “feeling”. Issue might occur when mainboard switches between PCIe 2.0 and PCIe 3.0. Must be something with power management
and bad mainbaord + graphics card tolerances.
6.) Issue never occured when setting higher GPU freuqencies (here between 1000 and 1620MHz): “nvidia-smi -lgc 1000,1620”
7.) MaxPerformanceMode might be a solution, not sure

Greetings,
Uli

4 Likes

2 Days ago I installed driver 440.82. System is much more unstable since then. Can’t go a couple of hours without reboot.

Happens 1-4 times in a 15h period (to me).
2/3 Displays (1 never tested)
Doesn’t matter which (Linux)OS or which applications are started.
Totally Random (sometimes after 30 min, sometimes after 14h).

Has to be x570 chipset, ryzon cpu, RTX 20xx super graphics card, no matter which bios version or graphics card driver…

System freeze, nothing is possible but hard reset. num led takes 30 sek, mouse is moveable but nothing takes. Keyboard (e.g shortcuts doesn’t work). Background services work, if I wait 2h for the hard reset there are system logs, for that time period, ssh works, affects only desktop environment.

occasionally (~1 day in 10) issue doesn’t occur. (same usage)

For reproducible purposes, just take my setup, install ubuntt/gnome and let it sit for 24hours, it will occur.
If not i will provide installed packages list.


Ryzen 9 3900x
GeForce RTX 2070 SUPER
ROG STRIX X570-E GAMING
64 GB Kingston RAM

Ubuntu 20.04
Gnome 3.36.2

BIOS Information:
Vendor: American Megatrends Inc.
Version: 1201
BIOS Revision: 5.14
Release Date: 2019/10/07

I’m also suffering from this with a GTX 1660 Super, for the record (also AMD Ryzen on x570 chispet).
Can’t confirm it’s more unstable with 440.82, no more, no less. As it is pretty random, difficult to say…

Just swapped in my old GTX 1060 to confirm it still stable with the GTX series.

I experienced both “nvlddmkm event 14” on Windows and “Xid 61” on Linux running OpenGL applications in a Dual Boot setup on my new machine. This rendered my PC unusable for a couple of months now.

After reading your hint, I went on and disabled SMT in my BIOS. I consistently haven’t experienced any issues since :)

While this is certainly not a permanent solution it still helps a lot and might provide some insight as to what causes the problem. Obviously Nvidia’s Drivers don’t play well with AMD’s Simultaneous Multithreading.

Same issue here with a 3700X / RTX 2060 SUPER / X570 MB setup with numerous freezes.

Today’s freeze was a little different because the desktop was still sortof functional but very slow. Noticed Chrome at 100%. After killing the process it was X that went to 100%.

Maybe the capture of nvidia-smi is interesting:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 206… Off | 00000000:0A:00.0 On | N/A |
|ERR! 49C P5 ERR! / 175W | 681MiB / 7979MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1350 G /usr/lib/xorg/Xorg 540MiB |
| 0 2407 G /usr/bin/nvidia-settings 23MiB |
| 0 2750 G cinnamon 113MiB |
±----------------------------------------------------------------------------+

I’ve set SMT to disabled now as some people seem to be having luck with that.

Could you try to set MaxPerformanceMode or set your GPU clock to higher frequencies?
For example to minimum 1000MHz, max 2000MHz

nvida-smi -lgc 1000,2000

Does the freeze still occur?

1 Like

I have been running into this issue, but I’ve had success so far by adding this to my xfce startup

nvidia-settings -a “[gpu:0]/GpuPowerMizerMode=1”

I’m now testing out the lock gpu clocks option while letting powermizer go back to default

nvidia-smi -lgc 1000,2145

My system is Ryzen 3900X on a ASRock X570 Taichi (latest bios) with a EVGA GeForce RTX 2060 SUPER running Ubuntu 20.04 and driver 440.64.

2 Likes

@Uli1234 this thread is about an incompatibility between AMD Ryzen 3xxx (Zen2) cpus and Turing gen GPUs. In this case you can even leave the system as is and just swap the cpu for a Ryzen 2xxx (Zen+) making the issue disappear.
Since you’re running an intel platform, I suspect you’re running into a different issue with the same symptoms.
Please open a new thread, run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

My system is still running without issue after several days using

sudo nvidia-smi -lgc 1000,2145

2 Likes

@generix Maybe the issue is not cpu related but has to do something with a power management chip for PCIe on the mainbaords. Since OldToby confirmed that locking the GPU frequencies to a higher idle level works, the issue might be the same.

1 Like

@OldToby Thank you for the feedback! Still no freeze so far? Anybody else can confirm that this works?

This is a serious bug affecting my workflow. Total system freeze daily with xorg running 100% on a single cpu thread together with Xid 61 .(ssh still works, and other bg process runs fine)
My system details:
System: Host: Kernel: 5.6.12-1-MANJARO x86_64 bits: 64 Desktop: i3 4.18.1 Distro: Manjaro Linux
Machine: Type: Desktop Mobo: ASUSTeK model: ROG STRIX X570-F GAMING v: Rev X.0x serial:
UEFI: American Megatrends v: 1407 date: 04/02/2020
CPU: Topology: 16-Core (2-Die) model: AMD Ryzen 9 3950X bits: 64 type: MT MCP MCM L2 cache: 8192 KiB
Speed: 3014 MHz min/max: 2200/3500 MHz Core speeds (MHz): 1: 3014 2: 2153 3: 2081 4: 3920 5: 2145 6: 2100
7: 3400 8: 2146 9: 2432 10: 2118 11: 2268 12: 1860 13: 2054 14: 2706 15: 2060 16: 2059 17: 2160 18: 2097
19: 2154 20: 2099 21: 3610 22: 2090 23: 2147 24: 1954 25: 2827 26: 2170 27: 2636 28: 1902 29: 1882 30: 1882
31: 2098 32: 2094
Graphics: Device-1: NVIDIA TU104 [GeForce RTX 2080 SUPER] driver: nvidia v: 440.82
Device-2: NVIDIA TU104 [GeForce RTX 2080 SUPER] driver: nvidia v: 440.82
Display: x11 server: X.Org 1.20.8 driver: nvidia resolution: 2560x1440~60Hz

@OldToby Going to try your workaround. Will let you know within 2 days.
Update. Day 1 : No Issues

It seems we’re seeing this combination of X570, Ryzen 3xxx and RTX 2xxx quite a bit. It’s pretty random, sometimes 1-2 weeks can pass without issues and other times it’s every couple days.

One thing to note: after the latest chromium update, I don’t get a full freeze anymore, just a very slow system. Are you using chromium too or another browser?

I tried disabling SMT like some suggested but with no effect. OldToby’s workaround is our hope right now :-)

I’m having an uptime of 5 days no with this workaround. Fingers crossed…

1 Like