Random Xid 61 and Xorg lock-up

amrits · May 27, 2020, 8:07pm

Hi jm4games,

I gained access to the MSI X570 system and am currently running the following setup.

Ubuntu 20.04
RTX 2070 Super
NV Driver 440.59
Ryzen 3700x

I’m currently running a few OpenGL demos simultaneously since a week but no luck in recreating issue.
I will try running now compton as per your suggestion and update my test results.

Uli1234 · May 28, 2020, 7:13am

The smi commands get lost after reboot if you just typed them into the console

@amrits:
You can try out to force the GPU to the idle frequency. In my system the issue then appeared within minutes.

sudo nvidia-smi -lgc 300,300

After reboot setting is lost.

@elialbert: In my experience persistence mode turned on or off didn’t matter. I tried both variants. PM mode is useful if you have more than one GPU in your system.

vinuvnair · May 28, 2020, 9:22am

I have observed the problem when running Electron Apps and Firefox. I use Riot, MS teams, VS code, and Unityhub. May be you can try running those and see the issue pops up. And I am also using a compton based compositor(picom)

elialbert · May 28, 2020, 12:26pm

I’m reading this
https://devblogs.nvidia.com/increase-performance-gpu-boost-k80-autoboost/
Have you put the GPU into persistence mode to even get it to honor this min clock speed request?

@amrits I’m going to second @vinuvnair suggestion - this seems to trigger when crossing thresholds between clock speeds, so constantly running high gpu processes is the opposite of the way to trigger it.
Try starting up some simple apps, then quitting them. I see it sometimes with zoom screensharing.

vinuvnair · May 29, 2020, 1:16pm

@OldToby I confirm, No issues since 2 days after setting the Freq.

Uli1234 · May 29, 2020, 2:32pm

@vinuvnair: Thanks for the feedback. Could you please post here if the issue ever occured again after setting the frequency

lencho · May 30, 2020, 11:12am

Hi,
I’m trying the frequency setting as well, hoping it fixes it.
A question: what exactly is persistence mode? Because when I set the smi setting I get the following message:

Gpu clocks set to "(gpuClkMin 1000, gpuClkMax 2145)" for GPU 00000000:07:00.0

Warning: persistence mode is disabled on device 00000000:07:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help | -h] switch to get more information on how to enable persistence mode.
All done.

Uli1234 · June 2, 2020, 6:43am

@lencho: You can turn on persistence mode with nvidia-smi -pm ENABLED
If you have only one GPU and only one client using the GPU then it shouldn’t make any difference. But it also doesn’t hurt to turn it on.

Citation: Persistence Mode is the term for a user-settable driver property that keeps a target GPU initialized even when no clients are connected to it. This solution is near end-of-life and will be eventually deprecated in favor the Persistence Daemon.

Did you encounter the freeze after setting the GPU min to 1000 MHz? If not, how long is your system already running?

Polesch · June 2, 2020, 5:37pm

Just want to add myself to the long list of people with this issue. Approximately twice a day (every 10-12 hours) my system will grind to a halt and everything will be incredibly slow. The only fix is a hard shutdown of the computer.

Running dmesg immediately after the issue:
[562369.410754] NVRM: GPU at PCI:0000:0a:00: GPU-486c43d1-2076-60e5-d3b0-d9c7876281f5
[562369.410757] NVRM: GPU Board Serial Number:
[562369.410761] NVRM: Xid (PCI:0000:0a:00): 61, pid=852, 0cec(3098) 00000000 00000000

If I unplug the display cable while in this state I am not able to get a signal again.

System:
RTX 2080 SUPER
AMD Ryzen 3950X
Asus Pro WS X570-ACE
Pop!_OS 20.04 LTS (5.4.0-7629-generic)

Uli1234 · June 3, 2020, 5:33pm

@Polesch
Try out the fix to set the GPU frequencies:

1.) nvidia-smi -pm ENABLED
2.) sudo nvidia-smi -lgc 1000,1815

1815MHz is the official Boost frequency for the RTX2080 super. You could also put in an even higher value like 2000. You don’t want the card to go to low values of the frequency. The settings will get lost after an reboot if just typed into the console. Could you give feedback if the fix worked for you. Thanks

Update:
I was able to gather some more information. In my case the freeze occurs when the GPU is in PCIe Gen2 mode and then switches to PCIe Gen3 with raised clocks.
Somehow some of the RTX generation cards do not handle well the switching of PCIe gens. I guess the switching down to Gen2 is done to save some energy. Therefore the fix to raise the idle frequency prevents the card to switch down into PCIe gen2 mode and forces it to stay in Gen3.

vladimirpanfilov · June 6, 2020, 8:20pm

I have the same problem randomly when monitor is leaving power-save mode. Monitor starts receiving signal (based on monitor LED), but the screen remains black and I need to press the reset button. My configuration:

CPU: AMD Ryzen 7 3700X
RAM 64 GB
OS: Arch Linux
GPU: GeForce GTX 1650 SUPER (NVidia driver 440.82)
M/B: MSI B450-A PRO MAX

Here is the sample output from journalctl -b -1. Logs after Xid 61 are different in various cases.

июн 06 22:39:22 interlace kernel: NVRM: GPU at PCI:0000:26:00: GPU-156cd20a-62f2-163c-f1c4-ab36b3027b6d
июн 06 22:39:22 interlace kernel: NVRM: GPU Board Serial Number: 
июн 06 22:39:22 interlace kernel: NVRM: Xid (PCI:0000:26:00): 61, pid=629, 0cec(3098) 00000000 00000000
июн 06 22:39:39 interlace audit[2731]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=3 pid=2731 comm="GpuWatchdog" exe="/usr/lib/chromium/chromium" sig=11 res=1
июн 06 22:39:39 interlace kernel: GpuWatchdog[2755]: segfault at 0 ip 000056180e9fad33 sp 00007f82561ea510 error 6 in chromium[56180a7d5000+763d000]
июн 06 22:39:39 interlace kernel: Code: 45 c0 48 39 c7 74 05 e8 ab 4c b3 fe c7 45 b0 aa aa aa aa 0f ae f0 41 8b 84 24 e8 00 00 00 89 45 b0 48 8d 7d b0 e8 fd 4d f9 fb <c7> 04 25 00 00 00 00 37 13 00 00 64 48 8b 04 25 28 00 00 00 48 3b
июн 06 22:39:39 interlace kernel: audit: type=1701 audit(1591472379.611:317): auid=1000 uid=1000 gid=1000 ses=3 pid=2731 comm="GpuWatchdog" exe="/usr/lib/chromium/chromium" sig=11 res=1
июн 06 22:39:39 interlace systemd[1]: Created slice system-systemd\x2dcoredump.slice.
июн 06 22:39:39 interlace audit: BPF prog-id=19 op=LOAD
июн 06 22:39:39 interlace kernel: audit: type=1334 audit(1591472379.624:318): prog-id=19 op=LOAD
июн 06 22:39:39 interlace kernel: audit: type=1334 audit(1591472379.624:319): prog-id=20 op=LOAD
июн 06 22:39:39 interlace audit: BPF prog-id=20 op=LOAD
июн 06 22:39:39 interlace systemd[1]: Started Process Core Dump (PID 9376/UID 0).
июн 06 22:39:39 interlace audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@0-9376-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
июн 06 22:39:39 interlace kernel: audit: type=1130 audit(1591472379.624:320): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@0-9376-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
июн 06 22:39:40 interlace systemd-coredump[9377]: Process 2731 (chromium) of user 1000 dumped core.

Uli1234 · June 8, 2020, 11:35am

@OldToby How many days are you running right now without the freeze? (With implemented Clocking fix).
I am asking because I will sell my systems to customers and I want to be 100% sure that the freeze doesn’t occur anymore.

jm4games · June 8, 2020, 3:13pm

fwiw I would probably wait till you hit 30 days of up time to validate the system is stable. I don’t think anyone on this thread has achieved that. I think I did 22 days once.

serhat · June 8, 2020, 5:41pm

At the moment, I’m at almost 18 days. Given that I had freezes every 3-5 days or so at some point, I’m cautiously optimistic.

If this is indeed the cause, then I think it should be easier for the people at nvidia to reproduce and solve as part of the driver. Having to artificially keep the clock higher than it needs to be is a kludge.

elialbert · June 9, 2020, 12:52pm

just in case it’s not obvious to someone, I followed the clock setup and put the script into a systemd startup script following the instructions here: https://linuxconfig.org/how-to-run-script-on-startup-on-ubuntu-20-04-focal-fossa-server-desktop
I do believe persistence mode may be necessary even with only one gpu.
and I am able to follow the clock speed by running
nvidia-smi dmon

the pclk value at the last column of output never goes below 1005, which corresponds to the new setting.

I will continue to check that this is true and report back on uptime.

ahuillet · June 9, 2020, 1:17pm

We haven’t been able to observe the problem internally, despite multiple attempts.
The priority remains for us to be able to observe the problem so that we can investigate. Minimization of the conditions to obtain a reliable reproduction would be very useful.
That Xid error happens when some part of the power management logic of the GPU encounters an unexpected situation (I cannot say more because I do not know/understand much more than this, being a userspace driver engineer). Therefore it seems unsurprising that locking GPU clocks and similar tricks might make the problem go away - but that doesn’t directly help us investigate.

Tomasu · June 9, 2020, 5:53pm

I don’t know if it helps, but when I encounter this issue the nvidia kernel driver’s irq thread gets “stuck” and eats up an entire core.

I previously posted my nvidia-bug-report log in another thread. but here’s my workstation specs.

My workstation:
Motherboard: X570 AORUS PRO WIFI (Gigabyte)
CPU: AMD R9 3700x
GPU: Gigabyte RTX 2060 Super Windforce OC 8GB
RAM: 64GB DDR4 3200Mhz (4x16GB) (Corsair Vengance LPX 32GB kit * 2)
Storage: Sabrent Rocket 4.0 1TB NVME Pcie4, Samsung 970 EVO Plus 500GB (boot, root, and home w/lvm), Samsung 970 EVO 500GB (windows drive), Samsung 860 EVO 1TB (bulk/overflow/scratch storage), “big” nfs share mounted over 1GB ethernet.

@ahuillet it can take hours, or days for the problem to occur for me. So it’s not something you can just find by turning the machine on. I think someone there will need to use the machine regularly for normal tasks. particularly try using a composited WM and run some other 3d tasks (gaming or rendering etc) and maybe run nvidia-smi regularly (every second or two?) to fetch gpu usage and temperature.

The past few days it happened to me twice. All I did was have my KDE desktop up (with compositing off IIRC), some apps including but not limited to: Several Chrome windows with TONS of tabs (hw accelleration enabled), Conky (system monitoring app), Kmail, Konversation, Discord, Riot, Steam (once, but not both times), Slack, Intellij IDEA Community, Jetbrain’s Toolbox, KDE’s Kate and Kwrite editors, Docker, libvirts’s virt-manager (no local VMs), bluetooth stuff, mullvad vpn daemon (but not connected), wireguard vpn (work only private auto connected via systemd’s networking config), teamviewer’s daemon (not connected)

generix · June 9, 2020, 7:11pm

Did anyonever try @Uli1234 's repro steps, i.e. lock gpu to minimum clocks (nvidia-smi -lgc 300,300), let it idle for a minute, then wiggle the mouse for a minute (at least that’s how I understand it).

han310 · June 10, 2020, 12:19am

I’ve been having this issue as well. I have tried setting sudo nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2115; but it locked up again after about 3 days.

My system:

OS: Arch Linux
Motherboard: asrock fatal1ty b450 gaming-itx/ac
CPU: AMD Ryzen 3600
GPU: GIGAByte RTX 2070 Super Windforce OC 3X 8GB
RAM: 32GB @ 3200MHz

Uli1234 · June 10, 2020, 7:31am

@han310 Could you try out to set the Max Performance Mode. Maybe that mode is doing some more in the background than just rising the GPU clock.

sudo DISPLAY=:0 nvidia-settings -a “[gpu:0]/GpuPowerMizerMode=1”

Following modes are availabel:

GpuPowerMizerMode=2 → Auto
GpuPowerMizerMode=1 → Prefer Maximum Performance
GpuPowerMizerMode=0 → Adaptive

Default mode is Auto.
Would be nice if you can give feedback if that worked for you.

And again just for insurance: You didn’t reboot your system somehow after you did set the clock frequencies?

Topic		Replies	Views
X Server 1.13.1 deadlocks randomly on GeForce GTX680 Linux	6	3146	January 4, 2013
Hung/frozen machine with X370 board, GTX 1060 card, Ryzen 5 CPU - Xid 32 & 69 - all driver versions Linux	13	1696	December 29, 2018
Xid errors on GTX 1070 @ linux Linux	11	3589	May 24, 2019
Frequent Freeze/Crash of Xorg with drivers 310.19 with GTS 250 on 3.2.0-4-amd64 Linux	20	16053	June 25, 2013
Xid 61 (black screen on startup) Ubuntu 18.04 GTX 1060 mobile Linux	12	3439	August 11, 2020
RTX 2080 Ti doesn't work for me on Fedora 30 Linux	28	2058	June 3, 2019
X hangs using 100% CPU, WAIT and mieq overflowing errors in logs Linux	67	23904	June 28, 2014
[Solved] XServer Freezes during gaming - Attempted to yield the CPU while in atomic or interrupt con Linux	8	5329	March 10, 2016
resume from suspend freezes system (GTX 970, Arch Linux, Kernel 4.4/4.7, NVIDIA 370) Linux	171	59140	June 18, 2017
Xid 61 with 319.32/325.08 on GTX 650. Linux	9	2921	January 7, 2014

Random Xid 61 and Xorg lock-up

Related topics