Random Xid 61 and Xorg lock-up

@OldToby I confirm, No issues since 2 days after setting the Freq.

@vinuvnair: Thanks for the feedback. Could you please post here if the issue ever occured again after setting the frequency

Hi,
I’m trying the frequency setting as well, hoping it fixes it.
A question: what exactly is persistence mode? Because when I set the smi setting I get the following message:

Gpu clocks set to "(gpuClkMin 1000, gpuClkMax 2145)" for GPU 00000000:07:00.0

Warning: persistence mode is disabled on device 00000000:07:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help | -h] switch to get more information on how to enable persistence mode.
All done.

@lencho: You can turn on persistence mode with nvidia-smi -pm ENABLED
If you have only one GPU and only one client using the GPU then it shouldn’t make any difference. But it also doesn’t hurt to turn it on.

Citation: Persistence Mode is the term for a user-settable driver property that keeps a target GPU initialized even when no clients are connected to it. This solution is near end-of-life and will be eventually deprecated in favor the Persistence Daemon.

Did you encounter the freeze after setting the GPU min to 1000 MHz? If not, how long is your system already running?

Just want to add myself to the long list of people with this issue. Approximately twice a day (every 10-12 hours) my system will grind to a halt and everything will be incredibly slow. The only fix is a hard shutdown of the computer.

Running dmesg immediately after the issue:
[562369.410754] NVRM: GPU at PCI:0000:0a:00: GPU-486c43d1-2076-60e5-d3b0-d9c7876281f5
[562369.410757] NVRM: GPU Board Serial Number:
[562369.410761] NVRM: Xid (PCI:0000:0a:00): 61, pid=852, 0cec(3098) 00000000 00000000

If I unplug the display cable while in this state I am not able to get a signal again.

System:
RTX 2080 SUPER
AMD Ryzen 3950X
Asus Pro WS X570-ACE
Pop!_OS 20.04 LTS (5.4.0-7629-generic)

1 Like

@Polesch
Try out the fix to set the GPU frequencies:

1.) nvidia-smi -pm ENABLED
2.) sudo nvidia-smi -lgc 1000,1815

1815MHz is the official Boost frequency for the RTX2080 super. You could also put in an even higher value like 2000. You don’t want the card to go to low values of the frequency. The settings will get lost after an reboot if just typed into the console. Could you give feedback if the fix worked for you. Thanks

Update:
I was able to gather some more information. In my case the freeze occurs when the GPU is in PCIe Gen2 mode and then switches to PCIe Gen3 with raised clocks.
Somehow some of the RTX generation cards do not handle well the switching of PCIe gens. I guess the switching down to Gen2 is done to save some energy. Therefore the fix to raise the idle frequency prevents the card to switch down into PCIe gen2 mode and forces it to stay in Gen3.

3 Likes

I have the same problem randomly when monitor is leaving power-save mode. Monitor starts receiving signal (based on monitor LED), but the screen remains black and I need to press the reset button. My configuration:

CPU: AMD Ryzen 7 3700X
RAM 64 GB
OS: Arch Linux
GPU: GeForce GTX 1650 SUPER (NVidia driver 440.82)
M/B: MSI B450-A PRO MAX

Here is the sample output from journalctl -b -1. Logs after Xid 61 are different in various cases.

июн 06 22:39:22 interlace kernel: NVRM: GPU at PCI:0000:26:00: GPU-156cd20a-62f2-163c-f1c4-ab36b3027b6d
июн 06 22:39:22 interlace kernel: NVRM: GPU Board Serial Number: 
июн 06 22:39:22 interlace kernel: NVRM: Xid (PCI:0000:26:00): 61, pid=629, 0cec(3098) 00000000 00000000
июн 06 22:39:39 interlace audit[2731]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=3 pid=2731 comm="GpuWatchdog" exe="/usr/lib/chromium/chromium" sig=11 res=1
июн 06 22:39:39 interlace kernel: GpuWatchdog[2755]: segfault at 0 ip 000056180e9fad33 sp 00007f82561ea510 error 6 in chromium[56180a7d5000+763d000]
июн 06 22:39:39 interlace kernel: Code: 45 c0 48 39 c7 74 05 e8 ab 4c b3 fe c7 45 b0 aa aa aa aa 0f ae f0 41 8b 84 24 e8 00 00 00 89 45 b0 48 8d 7d b0 e8 fd 4d f9 fb <c7> 04 25 00 00 00 00 37 13 00 00 64 48 8b 04 25 28 00 00 00 48 3b
июн 06 22:39:39 interlace kernel: audit: type=1701 audit(1591472379.611:317): auid=1000 uid=1000 gid=1000 ses=3 pid=2731 comm="GpuWatchdog" exe="/usr/lib/chromium/chromium" sig=11 res=1
июн 06 22:39:39 interlace systemd[1]: Created slice system-systemd\x2dcoredump.slice.
июн 06 22:39:39 interlace audit: BPF prog-id=19 op=LOAD
июн 06 22:39:39 interlace kernel: audit: type=1334 audit(1591472379.624:318): prog-id=19 op=LOAD
июн 06 22:39:39 interlace kernel: audit: type=1334 audit(1591472379.624:319): prog-id=20 op=LOAD
июн 06 22:39:39 interlace audit: BPF prog-id=20 op=LOAD
июн 06 22:39:39 interlace systemd[1]: Started Process Core Dump (PID 9376/UID 0).
июн 06 22:39:39 interlace audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@0-9376-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
июн 06 22:39:39 interlace kernel: audit: type=1130 audit(1591472379.624:320): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@0-9376-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
июн 06 22:39:40 interlace systemd-coredump[9377]: Process 2731 (chromium) of user 1000 dumped core.

@OldToby How many days are you running right now without the freeze? (With implemented Clocking fix).
I am asking because I will sell my systems to customers and I want to be 100% sure that the freeze doesn’t occur anymore.

1 Like

fwiw I would probably wait till you hit 30 days of up time to validate the system is stable. I don’t think anyone on this thread has achieved that. I think I did 22 days once.

1 Like

At the moment, I’m at almost 18 days. Given that I had freezes every 3-5 days or so at some point, I’m cautiously optimistic.

If this is indeed the cause, then I think it should be easier for the people at nvidia to reproduce and solve as part of the driver. Having to artificially keep the clock higher than it needs to be is a kludge.

1 Like

just in case it’s not obvious to someone, I followed the clock setup and put the script into a systemd startup script following the instructions here: https://linuxconfig.org/how-to-run-script-on-startup-on-ubuntu-20-04-focal-fossa-server-desktop
I do believe persistence mode may be necessary even with only one gpu.
and I am able to follow the clock speed by running
nvidia-smi dmon

the pclk value at the last column of output never goes below 1005, which corresponds to the new setting.

I will continue to check that this is true and report back on uptime.

1 Like

We haven’t been able to observe the problem internally, despite multiple attempts.
The priority remains for us to be able to observe the problem so that we can investigate. Minimization of the conditions to obtain a reliable reproduction would be very useful.
That Xid error happens when some part of the power management logic of the GPU encounters an unexpected situation (I cannot say more because I do not know/understand much more than this, being a userspace driver engineer). Therefore it seems unsurprising that locking GPU clocks and similar tricks might make the problem go away - but that doesn’t directly help us investigate.

I don’t know if it helps, but when I encounter this issue the nvidia kernel driver’s irq thread gets “stuck” and eats up an entire core.

I previously posted my nvidia-bug-report log in another thread. but here’s my workstation specs.

My workstation:
Motherboard: X570 AORUS PRO WIFI (Gigabyte)
CPU: AMD R9 3700x
GPU: Gigabyte RTX 2060 Super Windforce OC 8GB
RAM: 64GB DDR4 3200Mhz (4x16GB) (Corsair Vengance LPX 32GB kit * 2)
Storage: Sabrent Rocket 4.0 1TB NVME Pcie4, Samsung 970 EVO Plus 500GB (boot, root, and home w/lvm), Samsung 970 EVO 500GB (windows drive), Samsung 860 EVO 1TB (bulk/overflow/scratch storage), “big” nfs share mounted over 1GB ethernet.

@ahuillet it can take hours, or days for the problem to occur for me. So it’s not something you can just find by turning the machine on. I think someone there will need to use the machine regularly for normal tasks. particularly try using a composited WM and run some other 3d tasks (gaming or rendering etc) and maybe run nvidia-smi regularly (every second or two?) to fetch gpu usage and temperature.

The past few days it happened to me twice. All I did was have my KDE desktop up (with compositing off IIRC), some apps including but not limited to: Several Chrome windows with TONS of tabs (hw accelleration enabled), Conky (system monitoring app), Kmail, Konversation, Discord, Riot, Steam (once, but not both times), Slack, Intellij IDEA Community, Jetbrain’s Toolbox, KDE’s Kate and Kwrite editors, Docker, libvirts’s virt-manager (no local VMs), bluetooth stuff, mullvad vpn daemon (but not connected), wireguard vpn (work only private auto connected via systemd’s networking config), teamviewer’s daemon (not connected)

1 Like

Did anyonever try @Uli1234 's repro steps, i.e. lock gpu to minimum clocks (nvidia-smi -lgc 300,300), let it idle for a minute, then wiggle the mouse for a minute (at least that’s how I understand it).

I’ve been having this issue as well. I have tried setting sudo nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2115; but it locked up again after about 3 days.

My system:

OS: Arch Linux
Motherboard: asrock fatal1ty b450 gaming-itx/ac
CPU: AMD Ryzen 3600
GPU: GIGAByte RTX 2070 Super Windforce OC 3X 8GB
RAM: 32GB @ 3200MHz

@han310 Could you try out to set the Max Performance Mode. Maybe that mode is doing some more in the background than just rising the GPU clock.

sudo DISPLAY=:0 nvidia-settings -a "[gpu:0]/GpuPowerMizerMode=1"

Following modes are availabel:

GpuPowerMizerMode=2 -> Auto
GpuPowerMizerMode=1 -> Prefer Maximum Performance
GpuPowerMizerMode=0 -> Adaptive

Default mode is Auto.
Would be nice if you can give feedback if that worked for you.

And again just for insurance: You didn’t reboot your system somehow after you did set the clock frequencies?

@Uli1234 I am also trying Prefer Maximum Performance for PowerMizer right now. I set it via nvidia-settings GUI several days ago.

Hello, I just want to confirm the same issue on following setup:

ASUS ROG Zenith II Extreme
AMD Ryzen Threadripper 3970x
Quadro RTX 5000 | 440.82
Fedora 32

Heavy stutter is visible each time some gpu-intensive application (for example Davinci Resolve / SideFX Houdini / mpv) is started or closed. About once per two days Xorg locks-up with single thread at 100%.

@ahuillet

I have a large number of absolutely similar systems. Some show the issue, some not. So there is a random component included here that might come from a tolerances stack mainboard in combination with graphics card.
However, If I have an affected system, I can reproduce the issue by forcing the GPU to the lowest possible frequency of the card (in my case 300MHz).

1.) I open a terminal and use the command nvidia-smi dmon to show the the actual GPU and GPU-Memory clocks. I keept that window open all times
2.) I open a second window and use

nvidia-smi -pm ENABLED
sudo nvidia-smi -lgc 300,300

The effect of the command should be visible in the first terminal.

3.) I open a third terminal window and use the following command to show me the PCIe gen used:

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.current,temperature.gpu,clocks.gr,clocks.mem,power.draw --format=csv -l 3

I then start playing around, opening windows, closing them etc. The freeze normally occurs within a few minutes.
I can see that just before the freeze (or in the freeze) the GPU is in state P8 and the PCIe gen is 3 or 2
Which doesn’t make sense. P8 is a low power state and PCIe gen3 is for high power. I think here is the problem.
I reproduced the issue and the freeze always occurs when the card is in P8 and switches the PCIe gen one level up (2->3 or 1->2)

Terminal output provoked freeze #1:

2020/06/10 09:30:06.982, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 18.64 W
2020/06/10 09:30:09.986, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 18.71 W
2020/06/10 09:30:12.989, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 18.71 W
2020/06/10 09:30:15.991, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 40, 300 MHz, 405 MHz, 18.71 W
2020/06/10 09:30:18.995, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 18.80 W
2020/06/10 09:30:21.997, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 40, 300 MHz, 405 MHz, 15.62 W
2020/06/10 09:30:25.000, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 38.35 W
2020/06/10 09:30:28.002, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 41, 300 MHz, 810 MHz, 20.81 W
2020/06/10 09:30:31.004, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 28.29 W
2020/06/10 09:30:34.006, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 41, 300 MHz, 810 MHz, 31.78 W
2020/06/10 09:30:37.010, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 27.22 W
2020/06/10 09:30:40.012, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 27.70 W
2020/06/10 09:30:43.014, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 38.01 W
2020/06/10 09:30:46.016, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 41, 300 MHz, 810 MHz, 21.16 W
2020/06/10 09:30:49.018, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 41, 300 MHz, 810 MHz, 21.16 W
2020/06/10 09:30:52.025, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 38.44 W
2020/06/10 09:30:55.030, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 37.41 W
2020/06/10 09:30:58.032, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 25.49 W
2020/06/10 09:31:01.036, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 24.59 W
2020/06/10 09:31:04.038, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 25.34 W
2020/06/10 09:31:07.039, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 19.72 W
2020/06/10 09:31:10.041, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 1, 41, 300 MHz, 7000 MHz, 19.58 W
2020/06/10 09:31:13.052, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 19.42 W
2020/06/10 09:31:16.056, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 19.66 W
2020/06/10 09:31:19.058, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 2, 40, [Unknown Error], [Unknown Error], [Unknown Error]
2020/06/10 09:31:47.053, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 2, 40, [Unknown Error], [Unknown Error], [Unknown Error]
2020/06/10 09:32:16.053, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 2, 39, [Unknown Error], [Unknown Error], [Unknown Error]
2020/06/10 09:32:32.554, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 2, 39, [Unknown Error], [Unknown Error], [Unknown Error]

Terminal output provoked freeze #2:

2020/06/10 09:44:07.126, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 47, 300 MHz, 810 MHz, 20.99 W
2020/06/10 09:44:10.129, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 47, 300 MHz, 405 MHz, 18.82 W
2020/06/10 09:44:13.132, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 47, 300 MHz, 405 MHz, 19.12 W
2020/06/10 09:44:16.135, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 47, 300 MHz, 405 MHz, 19.11 W
2020/06/10 09:44:19.137, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 47, 300 MHz, 405 MHz, 19.03 W
2020/06/10 09:44:22.140, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 19.12 W
2020/06/10 09:44:25.142, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 19.06 W
2020/06/10 09:44:28.144, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 19.00 W
2020/06/10 09:44:31.147, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 19.24 W
2020/06/10 09:44:34.150, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 19.15 W
2020/06/10 09:44:37.153, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P3, 3, 46, 300 MHz, 5000 MHz, 35.63 W
2020/06/10 09:44:40.158, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 46, 300 MHz, 810 MHz, 21.74 W
2020/06/10 09:44:43.162, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 46, 300 MHz, 7000 MHz, 21.20 W
2020/06/10 09:44:46.164, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 21.47 W
2020/06/10 09:44:49.166, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 1, 46, 300 MHz, 7000 MHz, 21.74 W
2020/06/10 09:44:52.177, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 46, 300 MHz, 810 MHz, 19.70 W
2020/06/10 09:44:55.179, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 46, 300 MHz, 810 MHz, 19.78 W
2020/06/10 09:44:58.181, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 45, 300 MHz, 810 MHz, 19.82 W
2020/06/10 09:45:01.184, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 3, 45, [Unknown Error], [Unknown Error], [Unknown Error]
2020/06/10 09:45:17.595, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 3, 45, [Unknown Error], [Unknown Error], [Unknown Error]

So in the first case there is the switch from P8, 1 to P8, 2
In the second case there is the switch from P5, 2 to P8, 3
The freeze always occurs in state P8.
Maybe that information helps.

Greetings,

PS: In normal use I have never observed that the card is in state P8 and PCIe Gen3 at the same time. Only when the freeze occurs

4 Likes

Sounds good, we will try steps mentioned in comment #223 and will update ASAP with results.
Thank you so much for your efforts.

1 Like