Random Xid 61 and Xorg lock-up

I can confirm that after upgrading to the 450.66 driver I have not experienced an Xid 61 error (uptime about 4-5 days now). I am still locking the GPU frequencies to (700,1680) to keep it out of the lowest power state (P8, I believe); if I need to reboot again soon I will leave the frequencies alone and see what happens.

Previously I would encounter 1-2 Xid 61 errors per day, requiring 1-2 reboots to free up the graphics system. Whatever the change in the driver, that seems to at least be preventing the error from occurring in my system.

System:

  • ASRock X470 Gaming-ITX/ac with 32GB of memory (no overclock or XMP)
  • Ryzen 3700X (hyperthreading enabled, no overclock)
  • EVGA RTX 2060KO
  • Ubuntu 18.04 and Ubuntu 20.04

Thanks @amrits, @Uli1234 for your detective work.

@amrits, if possible for you to share, I’d be very interested to hear (and others probably would be too) how this was fixed in the driver.

… and, at just over 2.3 hours of uptime, I’ve had it hang again. I had walked away from the machine (it was idle), and when I returned, I found it frozen. I was still able to connect to it via SSH, confirm the Xid 61, and see the ERR in the nvidia-smi output.

Sep 17 16:04:57 ryzen kernel: NVRM: GPU at PCI:0000:0a:00: GPU-56625a86-54d2-7b0d-a55c-ac9736570e41
Sep 17 16:04:57 ryzen kernel: NVRM: GPU Board Serial Number: 
Sep 17 16:04:57 ryzen kernel: NVRM: Xid (PCI:0000:0a:00): 61, pid=2154, 0d02(31c4) 00000000 00000000

I’m currently investigating a theory that my Xid 61 errors are related to my motherboard’s SOC voltage setting. The voltage was automatically increased when I selected the DOCP for my 3600 MHz RAM. I’ve now overridden the SOC voltage to a lower value, and I’m seeing significantly improved stability. I’ll follow up in a few days…

yep, that’s what i said here . the popular Ryzen freq 3600 is a contributor. after 450, i also disabled the SMI fix and so far live xid-free life.

@services_nvidia1, you were certainly on to something with the RAM speed! :) Before I first posted the other day, I had read your post along with all the others in this thread. When I now look back at your post, I’m sorry I didn’t give enough credit to its second half, i.e., “AND lowering main memory speed.” I had focused on the “nvidia-smi” patch.

After lowering the SOC voltage (and not applying any other workarounds), my system, which would previously consistently encounter an Xid 61 error within a couple minutes of booting, has now been running for 11 hours. In the kernel logs, I should also note my CPU was reporting an error every few hours, e.g.,

Sep 20 15:58:40 kernel: mce: [Hardware Error]: Machine check events logged
Sep 20 15:58:40 kernel: [Hardware Error]: Corrected error, no action required.
Sep 20 15:58:40 kernel: [Hardware Error]: CPU:0 (17:71:0) MC27_STATUS[-|CE|MiscV|-|-|-|SyndV|-|-|-]: 0x982000000002080b
Sep 20 15:58:40 kernel: [Hardware Error]: IPID: 0x0001002e00000500, Syndrome: 0x000000005a020001
Sep 20 15:58:40 kernel: [Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2, Link Error.
Sep 20 15:58:40 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)

Those errors have stopped, too.

Some concrete numbers on the SOC voltages (“VDDCR SOC Voltage”):

  • default (without selecting DOCP): 1.025 V
  • DOCP value: 1.1 V (encountered Xid 61 errors)
  • manual override: 1.04375 V (stable, so far)

Congrats! btw 3466 sorted out the clocks for me plus game me higher performance in graphics (games), it was tested by good sources to have 2x higher MIN_fps perf (which matters for stutters) vs 3600. Is it true? Well, at least it helps stability. Blabbed about the divider here.

Now the nvidia issuefest is shrinking in size, and issues leave the party as time goes on, i found 2 more solutions:
[Status][Priority]Issue:Notes

  • [OK][High] Freezefest with Xid61: fix is to reduce clock or limit min PCI mode with nvidia-smi
  • [OK][High] blackscreen after login or resume - can be fixed with nvidia-settings advanced resolution settings, trying combinations, and sync it with desktop manager.
  • [KO][Medium] some windows empty after resume - cannot be fixed, you need to maximize window to refresh each time
  • [KO][Low] Waylandfreeze - have to wait few more years, but it’s ok, stay with X11 and wait
  • [OK][High] framerate drop in firefox after resume: this was top priority as scrolling was poor, fix is to set layout.frame_rate to your target fps

Remaining issues were reproduced on 10 distros, driver 440,450,455, kernel 5.4,5.5,5.6,5.7,5.8. No, there’s no lucky distro #Distrofest :)

Update via Ubuntu came 3 days ago to 450.66:
3 days went well, then:

Again system freeze. This time not x61, but had to hard reset whole system.

Sep 25 22:40:35 hostname kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Sep 25 22:40:35 hostname kernel: NVRM: GPU Board Serial Number:
Sep 25 22:40:35 hostname kernel: NVRM: Xid (PCI:0000:0b:00): 8, pid=2629, Channel 00000020

OS: Ubuntu 20.04.1 LTS x86_64
Kernel: 5.4.0-48-generic
AMD Ryzen 9 3900X
Asus ROG STRIX X570-E GAMING
Asus RTX 2070S ROG STRIX
NVIDIA Driver Version: 450.66

Going back to: sudo nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2000;

@thecakemaster From your system specs, we have the same MB and nearly the same processor. I have a Ryzen 9 3900XT. I’m curious whether you’ve looked at the VDDCR SOC voltage in the BIOS?

I did not.

I think it is very similar than others, but I only get the freeze when I am in idle / low compute mode. When I am running my GPUs heavily for a couple of days there are no issues at all, but when I am not utilizing them I get freezes every couple of hours.

Did you hit with with Xid 8 error after limiting clock values ?

No, after the Update to driver v. 450.66 I didn’t lock the frequencies anymore to test, if the issue was fixed.

I limited the clocks (nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2000;) beginning june, 29th and didn’t get an Xid since (as long as I locked the frequencies).

In the past i got:

kernel: NVRM: Xid (PCI:0000:0b:00): 61, pid=1607, 0cec(3098) 00000000 00000000

or more frequent Xid61 follow by Xid 8

e.g:
Jun 05 11:15:58 kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Jun 05 11:15:58 kernel: NVRM: GPU Board Serial Number:
Jun 05 11:15:58 kernel: NVRM: Xid (PCI:0000:0b:00): 61, pid=1600, 0cec(3098) 00000000 00000000
Jun 05 11:16:11 kernel: NVRM: Xid (PCI:0000:0b:00): 8, pid=1600, Channel 00000020

Jun 06 23:16:29 kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Jun 06 23:16:29 kernel: NVRM: GPU Board Serial Number:
Jun 06 23:16:29 kernel: NVRM: Xid (PCI:0000:0b:00): 61, pid=1590, 0cec(3098) 00000000 00000000
Jun 06 23:19:25 kernel: NVRM: Xid (PCI:0000:0b:00): 8, pid=1590, Channel 0000002e

Jun 07 15:54:13 kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Jun 07 15:54:13 kernel: NVRM: GPU Board Serial Number:
Jun 07 15:54:13 kernel: NVRM: Xid (PCI:0000:0b:00): 61, pid=1587, 0cec(3098) 00000000 00000000

Jun 08 18:32:44 kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Jun 08 18:32:44 kernel: NVRM: GPU Board Serial Number:
Jun 08 18:32:44 kernel: NVRM: Xid (PCI:0000:0b:00): 61, pid=1484, 0cec(3098) 00000000 00000000
Jun 08 18:32:55 kernel: NVRM: Xid (PCI:0000:0b:00): 8, pid=1484, Channel 00000020

This time I only got the Xid 8, but system was unresponsive as before.

Sep 25 22:40:35 kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Sep 25 22:40:35 kernel: NVRM: GPU Board Serial Number:
Sep 25 22:40:35 kernel: NVRM: Xid (PCI:0000:0b:00): 8, pid=2629, Channel 00000020

You can probably shave off another 200 Mhz on the lower bound, my machine is rock stable at 800,2130 (MSI 2070S).

I’m logging powerlevels for a couple of months now and it makes a noticeable difference in wattage used when the card can idle.

Can you please put GPU board into different PCIe slot and change the PCIe setting from BIOS menu and check for few days if issue persists after the changes.
Also can you please reconfirm complete repro steps and scenario.

Hi all,

Just chiming in with my experiences.

My hardware: GeForce RTX 2060 in an Asus Prime X570-P with a Ryzen 3900X. I have two monitors.

My software: Ubuntu 20.04 with kernel 5.8.9-050809-generic and Nvidia driver 450.66 and X.Org X Server 1.20.8.

I’m experience three types of hangs:

Hang type A) The X display freezes but the mouse point is still alive. Keyboard not working.
Hang type B) The X display is frozen with mouse and keyboard unresponsive.
Hang type C) Complete system hang. Display is frozen but machine has crashed.

Hang type A sometimes turns into hang type B. With A and B I can ssh into the machine to see what’s going on but I’m never able to restart X. The machine never cleanly shutdown after a shutdown command is issued.

Hang type C is much rarer and I can’t even ping the machine.

I’ve noticed hang types A and B have always (as far as I can tell) occured when a video is playing in Firefox or Chrome. Usually when I am using the activity switcher which shows an overview of the windows or after I return to the machine having paused a Youtube video. These type of hangs happen one a day or more frequent. When I increase the minimum GPU clock rates (using sudo nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2000) the crashes are less frequent, perhaps ever 3 or 4 days.

I have also reduced the VCCDR SOC Voltage from 1.1V to 1.04375V as suggested by NickB while also not reducing the GPU clock rates with nvidia-smi -lgc. This initially seemed to improve matters for a few days of good uptime, I then got two type A hangs in quick succession this evening.

The type C hang only seems to happen when the machine is unattended, usually at night, and this has only happened a few times. Because I can’t log into the machine I’m not able to determine the cause of the hang, so I’m focussing on types A and B only for now.

During a type A or B hang, here is the output from nvidia-smi:

Mon Sep 28 21:57:47 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:05:00.0  On |                  N/A |
| 25%   39C    P8    14W / 160W |    638MiB /  5926MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1845      G   /usr/lib/xorg/Xorg                 78MiB |
|    0   N/A  N/A      2453      G   /usr/lib/xorg/Xorg                315MiB |
|    0   N/A  N/A    402270      G   .../zoom-client/99/zoom/zoom       60MiB |
|    0   N/A  N/A    407831      G   ...token=7240939308061852906       47MiB |
+-----------------------------------------------------------------------------+

Top shows the irq/128-nvidia process has gotten stuck:

top - 21:58:19 up 2 days,  2:15,  3 users,  load average: 4.89, 5.16, 4.17
Tasks: 549 total,   5 running, 544 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.2 us,  4.3 sy,  0.0 ni, 91.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  48163.2 total,  26576.2 free,   7086.4 used,  14500.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  40438.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                        
   2453 root      20   0  426440 257020 135636 R 100.0   0.5 290:59.97 Xorg                           
   1925 root     -51   0       0      0      0 R  99.4   0.0 151:02.75 irq/128-nvidia                 
 794997 juan      20   0 6331968   2.5g   2.4g S   2.9   5.4  25:50.58 VirtualBoxVM                   
   2365 juan       9 -11 3318324  26764  21624 S   0.6   0.1  28:41.09 pulseaudio                     
   2567 juan      20   0 6230064 512480 128060 S   0.6   1.0 116:01.79 gnome-shell                    
 794942 juan      20   0   34392   9760   7808 S   0.6   0.0   0:55.20 VBoxXPCOMIPCD                  
 794948 juan      20   0  918636  25308  16544 S   0.6   0.1   1:51.22 VBoxSVC                        
      1 root      20   0  317304  13908   8436 S   0.0   0.0   0:04.25 systemd                        
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.13 kthreadd                       
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                         
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                     
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-kblockd           
      9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                   
     10 root      20   0       0      0      0 S   0.0   0.0   0:06.49 ksoftirqd/0                    
     11 root      20   0       0      0      0 I   0.0   0.0   0:38.52 rcu_sched                      
     12 root      rt   0       0      0      0 S   0.0   0.0   0:00.21 migration/0                    
     13 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/0                  
     14 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0                        
     15 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1                        
     16 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/1                  
     17 root      rt   0       0      0      0 S   0.0   0.0   0:00.54 migration/1                    
     18 root      20   0       0      0      0 S   0.0   0.0   0:02.17 ksoftirqd/1                    
     20 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/1:0H-kblockd           
     21 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/2                      

Even if I kill the Xorg process, the irq/128-nvidia process will still be pegged to 100%.

Syslog shows this:

Sep 28 21:26:42 yaffle kernel: [179011.023461] NVRM: GPU at PCI:0000:05:00: GPU-392ec1a3-7517-dd87-46d9-90bee037fd48
Sep 28 21:26:42 yaffle kernel: [179011.023470] NVRM: GPU Board Serial Number: 
Sep 28 21:26:42 yaffle kernel: [179011.023474] NVRM: Xid (PCI:0000:05:00): 8, pid=2567, Channel 00000020
Sep 28 21:26:52 yaffle kernel: [179021.255534] GpuWatchdog[19305]: segfault at 0 ip 000055cfefe69de7 sp 00007fa683c206d0 error 6 in signal-desktop[55cfecc8b000+53d3000]
Sep 28 21:26:52 yaffle kernel: [179021.255542] Code: 7d b7 00 79 09 48 8b 7d a0 e8 75 53 d3 fe 8b 83 00 01 00 00 85 c0 0f 84 91 00 00 00 48 8b 03 48 89 df be 01 00 00 00 ff 
50 68 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 b7 b0 6f 02 01 80 7d 87 00

Hopefully this can be of use to someone,
Juan.

The only known method to me to not encounter this issue is to not use X at all. When using SSH without X forwarding, everything works fine. I use my machine for deep learning, so I utilize the GPU relatively heavily. This is the only way I can use the machine reliably right now. Maybe I’ll also try Windows in the future since now there is a Cuda driver for WSL2.

The frequency fix didn’t work for you?

I’ve now seen months of stability since I applied it, before it my max uptime has rarely been more than 3 days. X is running non-stop since then. (3960X, ASRock Taichi, 2070S)

Hi t.platzer,

The clock tweak workaround improves matters greatly but I can’t say it solved them completely.

In any case, I’d rather have the GPU card running at the lowest clock when the machine is idle, so a longer term fix is still needed. In the meantime I’ll install recent mainline kernels, apply BIOS updates as they come out and I’ll be keeping an eye on this forum.

Hi thecakemaster,
Can you please put GPU board into different PCIe slot and change the PCIe setting from BIOS menu and check for few days if issue persists after the changes.
Also can you please reconfirm complete repro steps and scenario.

ok i’ve got further experience, i’ve tested nvidia behaviour in ubuntu, kubuntu, kde neon, mint, manjaro, arcolinux, endeavorOs, popOs, opensuse, fedora, mx linux, pclinuxos. all distros suffer from the issues (except windows).

Hang type A) The X display freezes but the mouse point is still alive. Keyboard not working.

and this issue i have to add to my list of issues as i noticed it only in *ubuntu distros. i replicate it real fast, by standby > resume > lock. i used to reset or ctrl+alt+backspace (kills apps), but now i do

loginctl unlocksessions

instead, and completely recover my session. In my case, this is sddm issue which will be fixed probably when open source drivers are released.
again, no Xid 61 anymore since less greedy memory clock (and with drivers 450, i no longer apply nvidia-smi fix).

1 Like