Random Xid 61 and Xorg lock-up

CPU: AMD Ryzen 7 3700X 8-Core Processor
GPU: GeForce RTX 2070 SUPER
Driver: nvidia 430.40
Kernel: 5.2.7-arch1-1-ARCH
DE: Xfce4 + xfwm4

This is a newly built machine that works stably under Windows 10 (hours of 3D gaming) so I think HW is fine. The card performs normally under Linux as well, until the issue happens.

What: Before it happens everything behaves normally and I can get a decent 3D performance. Suddenly it enters a state that the Xorg could easily hang. Initially (in that state) Iā€™m still able to move my mouse cursor, but if I, for example, move a window around, or switch a window in background to foreground, or scroll my text editor, i.e. have any action requiring window update, the entire desktop freezes and I cannot move the mouse cursor any more and Xorg takes 100% of CPU. The hang lasts for few seconds to minutes (the duration varies based on how much update is going on, i.e. moving bigger windows freezes X longer than smaller windows). After Xorg CPU drops to 0 it sort of ā€œrecoversā€ until I touch any window again (it freezes again).

When: Thereā€™s no way to tell if something particular that triggers this state. It could happen at any time, ranging from just sitting idle to when Iā€™m actively using it. In order to reproduce the issue and get the attached log, I have kept the machine running for days and used it as usual. This morning it finally hanged without any sign. I ssh-ed into my machine from my phone and took the log.

Iā€™ve debugged a little bit. I found what whenever it enters the abnormal state, my dmesg always has the following lines:

[28736.200395] NVRM: GPU at PCI:0000:07:00: GPU-06a0a514-1651-491d-717c-2e1e24b93c99
[28736.200398] NVRM: GPU Board Serial Number: 
[28736.200399] NVRM: Xid (PCI:0000:07:00): 61, 0cb5(2d50) 00000000 00000000

Iā€™ve searched in the docs and thereā€™s barely a detailed discussion on the Xid 61.

In addition Iā€™ve managed to get a strace on Xorg when everything got stuck (Xorg is at 100%). I can see when Xorg is busy, a swarm of SIGALRM (thousands) are sent by the kernel, at a rate of every 5000 us or so. I donā€™t see any of these when itā€™s not hanging. I hope this information is useful and gives a clue.

714   07:07:26.348280 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.348304 rt_sigreturn({mask=[]}) = 0
714   07:07:26.352926 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.352949 rt_sigreturn({mask=[]}) = 1
714   07:07:26.358292 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.358315 rt_sigreturn({mask=[]}) = 0
714   07:07:26.363284 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.363303 rt_sigreturn({mask=[]}) = 1
714   07:07:26.367927 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.367951 rt_sigreturn({mask=[]}) = 1
714   07:07:26.373287 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.373310 rt_sigreturn({mask=[]}) = 0
714   07:07:26.378282 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.378302 rt_sigreturn({mask=[]}) = 1
714   07:07:26.383348 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.383368 rt_sigreturn({mask=[]}) = 1
714   07:07:26.388290 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.388310 rt_sigreturn({mask=[]}) = 1
714   07:07:26.392923 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.392942 rt_sigreturn({mask=[]}) = 0
714   07:07:26.398271 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.398291 rt_sigreturn({mask=[]}) = 1
714   07:07:26.402926 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.402949 rt_sigreturn({mask=[]}) = 54088
714   07:07:26.408282 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.408302 rt_sigreturn({mask=[]}) = 0
714   07:07:26.412926 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.412945 rt_sigreturn({mask=[]}) = 13522
714   07:07:26.417927 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.417947 rt_sigreturn({mask=[]}) = 13522
714   07:07:26.422925 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.422945 rt_sigreturn({mask=[]}) = 0
714   07:07:26.428274 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.428294 rt_sigreturn({mask=[]}) = 1
714   07:07:26.432926 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.432948 rt_sigreturn({mask=[]}) = 0
714   07:07:26.438276 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.438295 rt_sigreturn({mask=[]}) = 1
714   07:07:26.442926 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.442945 rt_sigreturn({mask=[]}) = 0
714   07:07:26.448280 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
714   07:07:26.448300 rt_sigreturn({mask=[]}) = 0

Any help is appreciated.

nvidia-bug-report.log.gz (875 KB)
Xorg-strace.gz (22.9 KB)

Another thread in which the OP encounters similar issue (Xid 61, 0cb5(2d50)) on 2080 Ti.
https://devtalk.nvidia.com/default/topic/1057744/linux/gpus-give-err-with-nvrm-xid-pci-0000-b5-00-61/

Tem 22 00:05:34 boxx-cemil kernel: NVRM: GPU at PCI:0000:b6:00: GPU-ec01f57c-3598-5ad1-f52e-510b3cd43f28
Tem 22 00:05:34 boxx-cemil kernel: NVRM: GPU Board Serial Number: 
Tem 22 00:05:34 boxx-cemil kernel: NVRM: Xid (PCI:0000:b6:00): 61, 0cb5(2d50) 00000000 00000000

Apparently this issue affects a range of cards, thus worth attention.

I am experiencing exactly the same issues on very similiar hw/sw config:

CPU: Ryzen 3700X
GPU: GTX 2070 Super
MOBO: X570
driver: 430.40
system: Debian 10, kernel 4.19.0-5-amd64
DE: kde5 + sddm (standard in Debian 10)

My error is exectly the same:
NVRM: Xid (PCI:0000:0b:00): 61, 0cb5(2d50) 00000000 00000000

Before 2070 Super I was using GTX 1060 on the same setup with the same driver and did not experience those issues.

I tried changing PCI-E ports and speed from x16 to x8 but the issue remained.

I have also experienced intermittent Xid 61 on a similar hw/sw config:

CPU: Ryzen 3900X
GPU: RTX 2070
MOBO: ASUS PRIME x570-P
driver: 435.21
system: Arch Linux kernel 5.2.13-arch1-1-ARCH

Error is the same: NVRM: Xid (PCI:0000:09:00): 61, pid=1361563, 0cde(308c) 00000000 00000000[url][/url]

happened at 1 day and 8 hours of uptime.
nvidia-bug-report.log.gz (560 KB)

Hi All,

I have filed bug 200552106 internally for tracking purpose.

I tried to repro issue locally on below config setup but no luck.

CPU: AMD Ryzen 7 2700X Eight-Core Processor
GPU: GeForce RTX 2070
Driver: nvidia 430.40
Kernel:5.3.0-arch1-1-ARCH
DE: Xfce4
X.Org X Server 1.20.5

Played Quake Champions game along with youtube clips and glmark2 benchmark for few hours but not observed any xid 61 error. Will keep doing the same and observed for a day.

Meanwhile, if any one has been able to determine concrete repro steps or any particular application or game which can replicate issue as per wish, please let us know.

Thanks in advance.

Hi,

Please help to provide concrete repro steps or any particular application or game which can replicate issue as per my last comment.

I cannot provide any concrete repro steps since it seems to happen randomly. In my case the application is a proprietary VTK application which I cannot share.

I noticed that all the people in this thread have an AMD Ryzen 3000 series CPU, so perhaps that is related.

Iā€™m also seeing this particular freeze and itā€™s happening seemingly at random. Output of ā€œsudo journalctl -b-1ā€:

Sep 26 15:15:59 simplex kernel: NVRM: GPU at PCI:0000:09:00: GPU-9994e0ac-319e-fab8-0df6-6f3d8c681bd4
Sep 26 15:15:59 simplex kernel: NVRM: GPU Board Serial Number:
Sep 26 15:15:59 simplex kernel: NVRM: Xid (PCI:0000:09:00): 61, pid=1049, 0cde(308c) 00000000 00000000

CPU: AMD Ryzen 9 3900X
GPU: GeForce RTX 2080 Ti
Driver: 435.21
Kernel: 5.3.1-arch1-1-ARCH
X.Org X Server 1.20.5

Iā€™ve had this happen to me while typing in neovim, while just scrolling in firefox, etc. and it usually happens to me 1-2 times per day. I havenā€™t been in my windows partition as long as I have in my Arch partition but so far it hasnā€™t happened in Windows yet.

FWIW Iā€™m using 3 monitors attached via display port.

Itā€™s been 28 days and I have not experienced any more such issues so far. My config hasnā€™t changed, perhaps some system updates only.

For those 28 days Iā€™ve been playing games, running some cryptocoin software, etc., all to stress the GPU and computer in overall. Itā€™s been rock solid.

Iā€™ll be restarting system over the weekend to load kernel linux-image-4.19.0-6 update and to install 430.50 driver after that. Weā€™ll see if anything changes.

So Iā€™m starting to see a similar error in Windows as well. From the System logs in the Windows Event viewer

ā€œā€"
The description for Event ID 14 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\Video3
0cde(308c) 00000000 00000000

The message resource is present but the message was not found in the message table
ā€œā€"

Mainly think itā€™s similar because of this bit: 0cde(308c) 00000000 00000000. Iā€™m on Driver version 436.30, Windows 10 Education. This happened mainly while I was playing League of Legends; though it also happened while in Firefox while watching Youtube. In Windows, this error sometimes can recover by itself w/o a system restart.

I think at this point I might just contact EVGA support and try to get an RMA and if that doesnā€™t resolve the issue maybe look into the mobo/psu.

Another update with some test results after talking for a bit with EVGA support.

Test 1: GTX 670, Single Monitor (Displayport).
ā€“ Linux and Windows both seemed fine (split my time in each partition) for a full day.

Test 2: GTX 2080Ti, Single Monitor (Displayport).
ā€“ Linux only seemed fine for a full day.

Test 3: GTX 2080Ti, Triple Monitor (all displayport).
ā€“ Linux froze within a few hours of using the machine (firefox/terminal apps/slack/discord/spotify were the primary apps being used).

Hello, I have the same exact error ā€“ Xorg freezes and hangs, system is generally unresponsive and has to be shut down to be fixed. Happens at completely random times, usually when iā€™m doing light web browsing or when iā€™m away from the computer.

Error logs only report these three 3 errors before/during hang.

NVRM: GPU at PCI:0000:08:00: GPU-0c2194bd-374b-8167-920a-a0f65ab21955
NVRM: GPU Board Serial Number: 0323518084917
NVRM: Xid (PCI:0000:08:00): 61, pid=697, 0cde(308c) 00000000 00000000

My system specs are also remarkably similar:

CPU: AMD Ryzen 7 3700X 8- (16) @ 3.600Ghz
GPU: GeForce RTX 2080 Rev. A
MOBO: X570 I AORUS PRO WIFI -CF
Driver: nvidia 435.21
Kernel: 5.3.4-arch1-1-ARCH
DE: i3

I have had this problem since originally installing Arch on my machine since August 25th.

Thanks so much for any support you guys have.

CPU: AMD Ryzen 3000X Series
Board - X570 AORUS
GPU: GeForce RTX 2070
Driver: nvidia 430.40
Kernel:5.3.0-arch1-1-ARCH
DE: Xfce4
X.Org X Server 1.20.5

I kept running below tests simultaneously over weekend but no luck with repro so far.

Played Quake Champions game
Pleayed youtube clips
Executed glmark2 benchmark

Please confirm steps to install nvidia driver and any thing specific (repro steps) which can trigger issue at my end.

On my Arch system all I did to install the NVIDIA driver was

sudo pacman -S nvidia

and configured my X.org with nvidia-settings (sudo pacman -S nvidia-settings).

Iā€™ve noticed that the freeze is more ā€œreproducibleā€ when there are multiple monitors attached. I have 1x DELL U2412M and 2x ASUS VS248H attached via displayport. When I only have the DELL U2412M attached, the freezes donā€™t seem to happen (at least not within the day that I tested it for).

On a side note, while playing Teamfight Tactics on my Windows partition over the weekend I was able to get a freeze a bit more consistently (though still not a very reliable repro). Generally, I found the system to freeze when the game starts, the game ends, or when I alt-tab out of the game. Note that I play on fullscreen borderless and am using 3 monitors as noted above. Just an anecdote FWIW.

I still havenā€™t been able to find anything similar on my Linux partition unfortunately.

I always install nvidia drivers from binary file, so my installation is as simple as (for ex.):

bash NVIDIA-Linux-x86_64-430.34.run

And I follow on-screen instructions, always doing overwrite of everything if needed, installing 32-bit libraries, and never accept changes in Xorg config files. Then I reboot.

Yesterday I had first occurrence of the issue in 35 days of uptime. It was basically the same as all described above and happened while I was not using the computer (monitors were turned off, the system wasnā€™t doing anything special, but was online).

What I additionally checked after it happened:

  • the system was available through network/ssh, I could connect and do whatever I wanted from console;
  • I was running watch "nvidia-smi" in background and this process was taking 100% CPU time after the issue happened;
  • restart of X server did not do anything good;
  • shutdown -r now was taking very long so I eventually powered computer off and on.

Next Iā€™ve upgraded kernel to 4.19.0-6-amd64 (Debian 10) and drivers to NVIDIA-Linux-x86_64-430.50.run and since morning Iā€™m using it, no issues so far, but itā€™s been only several hours.

I was able to reproduce it twice today. I simply left it idle for about 30 min and it hung. No window is open except for the xfce4 itself.

On Windows it works pretty stable so far.

[ 4647.125149] NVRM: GPU at PCI:0000:07:00: GPU-06a0a514-1651-491d-717c-2e1e24b93c99
[ 4647.125151] NVRM: GPU Board Serial Number:
[ 4647.125156] NVRM: Xid (PCI:0000:07:00): 61, pid=697, 0cde(308c) 00000000 00000000

Forgot to mention the driver version, which is 435.21

OK, a bit anecdotal but this freeze hasnā€™t happened to me in a few days ever since I updated my BIOS.

My mobo was the Gigabyte X570 AORUS Ultra and I was initially on version F2 I believe (or whatever version that shipped with the motherboard). I updated to version F6b (X570 AORUS ULTRA (rev. 1.0) Support | Motherboard - GIGABYTE U.S.A.) and I havenā€™t seen this freeze on Windows or Linux over the weekend.

EDIT: Spoke too soon, Xid 61 happened again.

I have similar error.
Xorg freezes and hangs up at random.
top command shows Xorg, irq/77-nvidia, or gnome-shell exhausts single CPU thread.

$ sudo journalctl -b-1
ā€¦
10꜈ 28 07:49:28 ryzen kernel: NVRM: GPU at PCI:0000:09:00: GPU-be2df1f1-299f-707a-132f-03ba00b48935
10꜈ 28 07:49:28 ryzen kernel: NVRM: GPU Board Serial Number:
10꜈ 28 07:49:28 ryzen kernel: NVRM: Xid (PCI:0000:09:00): 61, pid=2047, 0cde(308c) 00000000 00000000
ā€¦

My PC specs are:

CPU: AMD Ryzen 7 3700X 8-Core Processor
GPU: GeForce GTX 1660 Ti
MOBO: B450M-A
Driver: nvidia 435.21
Kernel: 4.15.0-66-generic
(Ubuntu 18.04 LTS)

I can add myself to the lucky few:

Oct 29 19:54:55 . kernel: NVRM: GPU at PCI:0000:08:00: GPU-5fee9a2d-fb8d-d71d-4d0f-a8e03fa1fd01
Oct 29 19:54:55 . kernel: NVRM: GPU Board Serial Number:
Oct 29 19:54:55 . kernel: NVRM: Xid (PCI:0000:08:00): 61, pid=812, 0cde(308c) 00000000 00000000

Happened also a few days ago for the first time since building the PC about a month ago.
As with other posts, the system was running and responding via SSH, but the x server locking up completely.

The system is a
AMD Ryzen 7 3700X
ASUS TUF Gaming X570-Plus
GeForce GTX 1660
arch linux and nvidia 435.21 drivers: 5.3.7-arch1-2-ARCH #1 SMP PREEMPT @1572002934 x86_64 GNU/Linux
with three monitors attached.