Random Xid 61 and Xorg lock-up

I think it is very similar than others, but I only get the freeze when I am in idle / low compute mode. When I am running my GPUs heavily for a couple of days there are no issues at all, but when I am not utilizing them I get freezes every couple of hours.

Did you hit with with Xid 8 error after limiting clock values ?

No, after the Update to driver v. 450.66 I didn’t lock the frequencies anymore to test, if the issue was fixed.

I limited the clocks (nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2000;) beginning june, 29th and didn’t get an Xid since (as long as I locked the frequencies).

In the past i got:

kernel: NVRM: Xid (PCI:0000:0b:00): 61, pid=1607, 0cec(3098) 00000000 00000000

or more frequent Xid61 follow by Xid 8

e.g:
Jun 05 11:15:58 kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Jun 05 11:15:58 kernel: NVRM: GPU Board Serial Number:
Jun 05 11:15:58 kernel: NVRM: Xid (PCI:0000:0b:00): 61, pid=1600, 0cec(3098) 00000000 00000000
Jun 05 11:16:11 kernel: NVRM: Xid (PCI:0000:0b:00): 8, pid=1600, Channel 00000020

Jun 06 23:16:29 kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Jun 06 23:16:29 kernel: NVRM: GPU Board Serial Number:
Jun 06 23:16:29 kernel: NVRM: Xid (PCI:0000:0b:00): 61, pid=1590, 0cec(3098) 00000000 00000000
Jun 06 23:19:25 kernel: NVRM: Xid (PCI:0000:0b:00): 8, pid=1590, Channel 0000002e

Jun 07 15:54:13 kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Jun 07 15:54:13 kernel: NVRM: GPU Board Serial Number:
Jun 07 15:54:13 kernel: NVRM: Xid (PCI:0000:0b:00): 61, pid=1587, 0cec(3098) 00000000 00000000

Jun 08 18:32:44 kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Jun 08 18:32:44 kernel: NVRM: GPU Board Serial Number:
Jun 08 18:32:44 kernel: NVRM: Xid (PCI:0000:0b:00): 61, pid=1484, 0cec(3098) 00000000 00000000
Jun 08 18:32:55 kernel: NVRM: Xid (PCI:0000:0b:00): 8, pid=1484, Channel 00000020

This time I only got the Xid 8, but system was unresponsive as before.

Sep 25 22:40:35 kernel: NVRM: GPU at PCI:0000:0b:00: GPU-cc6e3660-8db4-9431-a0ae-3355f10ac91b
Sep 25 22:40:35 kernel: NVRM: GPU Board Serial Number:
Sep 25 22:40:35 kernel: NVRM: Xid (PCI:0000:0b:00): 8, pid=2629, Channel 00000020

You can probably shave off another 200 Mhz on the lower bound, my machine is rock stable at 800,2130 (MSI 2070S).

I’m logging powerlevels for a couple of months now and it makes a noticeable difference in wattage used when the card can idle.

Can you please put GPU board into different PCIe slot and change the PCIe setting from BIOS menu and check for few days if issue persists after the changes.
Also can you please reconfirm complete repro steps and scenario.

Hi all,

Just chiming in with my experiences.

My hardware: GeForce RTX 2060 in an Asus Prime X570-P with a Ryzen 3900X. I have two monitors.

My software: Ubuntu 20.04 with kernel 5.8.9-050809-generic and Nvidia driver 450.66 and X.Org X Server 1.20.8.

I’m experience three types of hangs:

Hang type A) The X display freezes but the mouse point is still alive. Keyboard not working.
Hang type B) The X display is frozen with mouse and keyboard unresponsive.
Hang type C) Complete system hang. Display is frozen but machine has crashed.

Hang type A sometimes turns into hang type B. With A and B I can ssh into the machine to see what’s going on but I’m never able to restart X. The machine never cleanly shutdown after a shutdown command is issued.

Hang type C is much rarer and I can’t even ping the machine.

I’ve noticed hang types A and B have always (as far as I can tell) occured when a video is playing in Firefox or Chrome. Usually when I am using the activity switcher which shows an overview of the windows or after I return to the machine having paused a Youtube video. These type of hangs happen one a day or more frequent. When I increase the minimum GPU clock rates (using sudo nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2000) the crashes are less frequent, perhaps ever 3 or 4 days.

I have also reduced the VCCDR SOC Voltage from 1.1V to 1.04375V as suggested by NickB while also not reducing the GPU clock rates with nvidia-smi -lgc. This initially seemed to improve matters for a few days of good uptime, I then got two type A hangs in quick succession this evening.

The type C hang only seems to happen when the machine is unattended, usually at night, and this has only happened a few times. Because I can’t log into the machine I’m not able to determine the cause of the hang, so I’m focussing on types A and B only for now.

During a type A or B hang, here is the output from nvidia-smi:

Mon Sep 28 21:57:47 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:05:00.0  On |                  N/A |
| 25%   39C    P8    14W / 160W |    638MiB /  5926MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1845      G   /usr/lib/xorg/Xorg                 78MiB |
|    0   N/A  N/A      2453      G   /usr/lib/xorg/Xorg                315MiB |
|    0   N/A  N/A    402270      G   .../zoom-client/99/zoom/zoom       60MiB |
|    0   N/A  N/A    407831      G   ...token=7240939308061852906       47MiB |
+-----------------------------------------------------------------------------+

Top shows the irq/128-nvidia process has gotten stuck:

top - 21:58:19 up 2 days,  2:15,  3 users,  load average: 4.89, 5.16, 4.17
Tasks: 549 total,   5 running, 544 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.2 us,  4.3 sy,  0.0 ni, 91.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  48163.2 total,  26576.2 free,   7086.4 used,  14500.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  40438.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                        
   2453 root      20   0  426440 257020 135636 R 100.0   0.5 290:59.97 Xorg                           
   1925 root     -51   0       0      0      0 R  99.4   0.0 151:02.75 irq/128-nvidia                 
 794997 juan      20   0 6331968   2.5g   2.4g S   2.9   5.4  25:50.58 VirtualBoxVM                   
   2365 juan       9 -11 3318324  26764  21624 S   0.6   0.1  28:41.09 pulseaudio                     
   2567 juan      20   0 6230064 512480 128060 S   0.6   1.0 116:01.79 gnome-shell                    
 794942 juan      20   0   34392   9760   7808 S   0.6   0.0   0:55.20 VBoxXPCOMIPCD                  
 794948 juan      20   0  918636  25308  16544 S   0.6   0.1   1:51.22 VBoxSVC                        
      1 root      20   0  317304  13908   8436 S   0.0   0.0   0:04.25 systemd                        
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.13 kthreadd                       
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                         
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                     
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-kblockd           
      9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                   
     10 root      20   0       0      0      0 S   0.0   0.0   0:06.49 ksoftirqd/0                    
     11 root      20   0       0      0      0 I   0.0   0.0   0:38.52 rcu_sched                      
     12 root      rt   0       0      0      0 S   0.0   0.0   0:00.21 migration/0                    
     13 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/0                  
     14 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0                        
     15 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1                        
     16 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/1                  
     17 root      rt   0       0      0      0 S   0.0   0.0   0:00.54 migration/1                    
     18 root      20   0       0      0      0 S   0.0   0.0   0:02.17 ksoftirqd/1                    
     20 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/1:0H-kblockd           
     21 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/2                      

Even if I kill the Xorg process, the irq/128-nvidia process will still be pegged to 100%.

Syslog shows this:

Sep 28 21:26:42 yaffle kernel: [179011.023461] NVRM: GPU at PCI:0000:05:00: GPU-392ec1a3-7517-dd87-46d9-90bee037fd48
Sep 28 21:26:42 yaffle kernel: [179011.023470] NVRM: GPU Board Serial Number: 
Sep 28 21:26:42 yaffle kernel: [179011.023474] NVRM: Xid (PCI:0000:05:00): 8, pid=2567, Channel 00000020
Sep 28 21:26:52 yaffle kernel: [179021.255534] GpuWatchdog[19305]: segfault at 0 ip 000055cfefe69de7 sp 00007fa683c206d0 error 6 in signal-desktop[55cfecc8b000+53d3000]
Sep 28 21:26:52 yaffle kernel: [179021.255542] Code: 7d b7 00 79 09 48 8b 7d a0 e8 75 53 d3 fe 8b 83 00 01 00 00 85 c0 0f 84 91 00 00 00 48 8b 03 48 89 df be 01 00 00 00 ff 
50 68 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 b7 b0 6f 02 01 80 7d 87 00

Hopefully this can be of use to someone,
Juan.

The only known method to me to not encounter this issue is to not use X at all. When using SSH without X forwarding, everything works fine. I use my machine for deep learning, so I utilize the GPU relatively heavily. This is the only way I can use the machine reliably right now. Maybe I’ll also try Windows in the future since now there is a Cuda driver for WSL2.

The frequency fix didn’t work for you?

I’ve now seen months of stability since I applied it, before it my max uptime has rarely been more than 3 days. X is running non-stop since then. (3960X, ASRock Taichi, 2070S)

Hi t.platzer,

The clock tweak workaround improves matters greatly but I can’t say it solved them completely.

In any case, I’d rather have the GPU card running at the lowest clock when the machine is idle, so a longer term fix is still needed. In the meantime I’ll install recent mainline kernels, apply BIOS updates as they come out and I’ll be keeping an eye on this forum.

Hi thecakemaster,
Can you please put GPU board into different PCIe slot and change the PCIe setting from BIOS menu and check for few days if issue persists after the changes.
Also can you please reconfirm complete repro steps and scenario.

ok i’ve got further experience, i’ve tested nvidia behaviour in ubuntu, kubuntu, kde neon, mint, manjaro, arcolinux, endeavorOs, popOs, opensuse, fedora, mx linux, pclinuxos. all distros suffer from the issues (except windows).

Hang type A) The X display freezes but the mouse point is still alive. Keyboard not working.

and this issue i have to add to my list of issues as i noticed it only in *ubuntu distros. i replicate it real fast, by standby > resume > lock. i used to reset or ctrl+alt+backspace (kills apps), but now i do

loginctl unlocksessions

instead, and completely recover my session. In my case, this is sddm issue which will be fixed probably when open source drivers are released.
again, no Xid 61 anymore since less greedy memory clock (and with drivers 450, i no longer apply nvidia-smi fix).

1 Like

Hi amrits,

what should I set my bios settings to?

@amrits

Even though the bug manifested vary rarely for me and now it’s practically gone, I occasionally see this message in my kernel log:

nvidia 0000:07:00.0: 64.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x16 link at 0000:00:03.1 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)

It does not lead to any issues though but I still don’t like it. Windows 10 on the same PC doesn’t report anything like that.

My PC specs:

Ryzen 7 3700X (stock)
RAM 4x16GB DDR4 3600MHz (OC’ed from 3200MHz)
ASUS TUF Gaming X570-Plus WiFi
GeForce 1660 Ti (stock)

Linux 5.8.13 vanilla
NVIDIA drivers 455.28 with no custom settings whatsoever except:

Section "Device"
        Identifier      "Videocard0"
        BusID           "PCI:7:0:0"
        Driver          "nvidia"
        VendorName      "NVIDIA"
        BoardName       "NVIDIA Corporation TU116 [GeForce GTX 1660 Ti] (rev a1)"
        Option          "Coolbits" "28"
        Option          "metamodes" "nvidia-auto-select +0+0 {ForceCompositionPipeline=On, ForceFullCompositionPipeline=On}"
        Option          "UseEDIDFreqs" "Off"
        Option          "UseNvKmsCompositionPipeline" "Off"
EndSection

Another update - I installed 450.66 on ubuntu on Sept 12 and havn’t had a single lockup since. fingers crossed.

Thanks for the feedback!

Hi all, I’ve been experiencing the Xid 61 lockups as well, and have tried a few remedies without success.

General system notes

  • Intel Core i5-7260U (via NUC 7 - NUC7i5BNH )
  • EVGA GeForce RTX 2070 SUPER FTW3 via Razer Core X External GPU Enclosure
  • Linux Mint 19.3 ( Ubuntu 18.04.3 )
  • Linux kernel 5.4.0-40
  • Nividia driver 450.66

Testing notes

  • locked gpu minimum clock frequency via nvidia-smi -lgc 1200,1815
  • nvidia-smi output every 3 seconds
  • lockup occurs ~every 4 days

Incident 1

# journalctl
Oct 08 03:58:51 nuc7 kernel: NVRM: GPU at PCI:0000:06:00: GPU-876f6fbc-d206-7c5b-243a-c8654d75bdb3
Oct 08 03:58:51 nuc7 kernel: NVRM: GPU Board Serial Number: 
Oct 08 03:58:51 nuc7 kernel: NVRM: Xid (PCI:0000:06:00): 61, pid=1636, 0d02(31c4) 00000000 00000000
Oct 08 03:59:45 nuc7 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [cinnamon:3493]

# nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.current,temperature.gpu,clocks.gr,clocks.mem,power.draw --format=csv -l 3
timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.current, temperature.gpu, clocks.current.graphics [MHz], clocks.current.memory [MHz], power.draw [W]
2020/10/08 03:58:39.900, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 41, 1200 MHz, 810 MHz, 27.17 W
2020/10/08 03:58:42.905, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 41, 1200 MHz, 810 MHz, 27.22 W
2020/10/08 03:58:45.909, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 41, 1200 MHz, 810 MHz, 27.22 W
2020/10/08 03:58:48.915, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 41, [Unknown Error], [Unknown Error], [Unknown Error]
2020/10/08 03:58:58.935, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 41, [Unknown Error], [Unknown Error], [Unknown Error]
2020/10/08 03:59:01.936, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 41, [Unknown Error], [Unknown Error], [Unknown Error]

Incident 2

# journalctl
Oct 11 16:24:34 nuc7 kernel: NVRM: GPU at PCI:0000:06:00: GPU-876f6fbc-d206-7c5b-243a-c8654d75bdb3
Oct 11 16:24:35 nuc7 kernel: NVRM: GPU Board Serial Number: 
Oct 11 16:24:35 nuc7 kernel: NVRM: Xid (PCI:0000:06:00): 61, pid=1683, 0d02(31c4) 00000000 00000000
Oct 11 16:25:22 nuc7 kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [cinnamon:4004]

# nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.current,temperature.gpu,clocks.gr,clocks.mem,power.draw --format=csv -l 3
timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.current, temperature.gpu, clocks.current.graphics [MHz], clocks.current.memory [MHz], power.draw [W]
2020/10/11 16:24:23.134, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, 1200 MHz, 810 MHz, 27.32 W
2020/10/11 16:24:26.140, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, 1200 MHz, 810 MHz, 27.48 W
2020/10/11 16:24:29.145, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, 1200 MHz, 810 MHz, 27.51 W
2020/10/11 16:24:32.149, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, [Unknown Error], [Unknown Error], [Unknown Error]
2020/10/11 16:24:37.997, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, [Unknown Error], [Unknown Error], [Unknown Error]
2020/10/11 16:24:40.998, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, [Unknown Error], [Unknown Error], [Unknown Error]

Incident 3

# journalctl
Oct 16 03:02:41 nuc7 kernel: NVRM: GPU at PCI:0000:06:00: GPU-876f6fbc-d206-7c5b-243a-c8654d75bdb3
Oct 16 03:02:41 nuc7 kernel: NVRM: GPU Board Serial Number: 
Oct 16 03:02:41 nuc7 kernel: NVRM: Xid (PCI:0000:06:00): 61, pid=1638, 0d02(31c4) 00000000 00000000
Oct 16 03:03:35 nuc7 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 21s! [cinnamon:6060]

# nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.current,temperature.gpu,clocks.gr,clocks.mem,power.draw --format=csv -l 3
timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.current, temperature.gpu, clocks.current.graphics [MHz], clocks.current.memory [MHz], power.draw [W]
2020/10/16 03:02:34.165, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, 1200 MHz, 810 MHz, 27.49 W
2020/10/16 03:02:37.176, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, 1200 MHz, 810 MHz, 27.66 W
2020/10/16 03:02:40.182, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, 1200 MHz, 810 MHz, 27.59 W
2020/10/16 03:02:43.191, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, [Unknown Error], [Unknown Error], [Unknown Error]
2020/10/16 03:02:46.193, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, [Unknown Error], [Unknown Error], [Unknown Error]
2020/10/16 03:02:49.193, GeForce RTX 2070 SUPER, 00000000:06:00.0, 450.66, P5, 2, 45, [Unknown Error], [Unknown Error], [Unknown Error]

I’ve tried to follow the suggestions floating around on this thread (updating driver to 450.66, and locking min clock frequencies to keep system in P5 state or higher), but neither seem to remedy the issue.

Are there potentially other suggestions that I may have overlooked? I saw the @services_nvidia1’s comments about reducing the cpu memory clock frequency (and/or SOC voltage), but I wasn’t sure if that’s applicable/appropriate to Intel cpus?

I hope this provides at the very least another data point in the Great Xid 61 Saga!

Thanks all

-Lawrence

Hi Masterlaws,

One thing I tried that does seem to have made a difference is locking the GPU clock to one frequency only:

eg. nvidia-smi -lgc 1620,1620

Since fixing the frequency this way I haven’t had a lock up which is great news for me. Ideally I’d like the card to freely flow up and down clockrates as needed. However, this PC is one I’m using for my work and I don’t have much time to try to make X·org lock up on purpose, so my experiences are just another anecdotal data point to be added to the pot.

@juan-nvidia thanks for the suggestion, I’ll give it a shot! Is there any particular theory behind using such high numbers? (e.g. a particular P state/range that we’re trying to maintain?) I’ll probably start at 1300 and work my way up.

cheers

-Lawrence

Hi masterlaws,
you might try to enable persistence mode before locking the GPU Clock. At least that’s what Nvidia suggests in their documentation. And just a reminder, that the settings are lost after reboot, if not written in a startup script.

I’d suggest to do it the other way around: start with 1300 and work your way DOWN. Or, rather start with something like 800 to begin with.

I’ve had frequent Xid 61’s for months, but my card is stable now with 800,* at the lower bound.

If you log the watts the card is drawing you’ll quickly see that you want to go as low as possible. At 1300 lower bound you have the equivalent of a permanently glowing lightbulb instead of a 3W flicker when the card functions correctly.