Random Xid 61 and Xorg lock-up

@Uli1234 I am also trying Prefer Maximum Performance for PowerMizer right now. I set it via nvidia-settings GUI several days ago.

Hello, I just want to confirm the same issue on following setup:

ASUS ROG Zenith II Extreme
AMD Ryzen Threadripper 3970x
Quadro RTX 5000 | 440.82
Fedora 32

Heavy stutter is visible each time some gpu-intensive application (for example Davinci Resolve / SideFX Houdini / mpv) is started or closed. About once per two days Xorg locks-up with single thread at 100%.

@ahuillet

I have a large number of absolutely similar systems. Some show the issue, some not. So there is a random component included here that might come from a tolerances stack mainboard in combination with graphics card.
However, If I have an affected system, I can reproduce the issue by forcing the GPU to the lowest possible frequency of the card (in my case 300MHz).

1.) I open a terminal and use the command nvidia-smi dmon to show the the actual GPU and GPU-Memory clocks. I keept that window open all times
2.) I open a second window and use

nvidia-smi -pm ENABLED
sudo nvidia-smi -lgc 300,300

The effect of the command should be visible in the first terminal.

3.) I open a third terminal window and use the following command to show me the PCIe gen used:

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.current,temperature.gpu,clocks.gr,clocks.mem,power.draw --format=csv -l 3

I then start playing around, opening windows, closing them etc. The freeze normally occurs within a few minutes.
I can see that just before the freeze (or in the freeze) the GPU is in state P8 and the PCIe gen is 3 or 2
Which doesnā€™t make sense. P8 is a low power state and PCIe gen3 is for high power. I think here is the problem.
I reproduced the issue and the freeze always occurs when the card is in P8 and switches the PCIe gen one level up (2->3 or 1->2)

Terminal output provoked freeze #1:

2020/06/10 09:30:06.982, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 18.64 W
2020/06/10 09:30:09.986, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 18.71 W
2020/06/10 09:30:12.989, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 18.71 W
2020/06/10 09:30:15.991, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 40, 300 MHz, 405 MHz, 18.71 W
2020/06/10 09:30:18.995, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 18.80 W
2020/06/10 09:30:21.997, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 40, 300 MHz, 405 MHz, 15.62 W
2020/06/10 09:30:25.000, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 38.35 W
2020/06/10 09:30:28.002, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 41, 300 MHz, 810 MHz, 20.81 W
2020/06/10 09:30:31.004, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 28.29 W
2020/06/10 09:30:34.006, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 41, 300 MHz, 810 MHz, 31.78 W
2020/06/10 09:30:37.010, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 27.22 W
2020/06/10 09:30:40.012, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 27.70 W
2020/06/10 09:30:43.014, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 38.01 W
2020/06/10 09:30:46.016, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 41, 300 MHz, 810 MHz, 21.16 W
2020/06/10 09:30:49.018, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 41, 300 MHz, 810 MHz, 21.16 W
2020/06/10 09:30:52.025, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 38.44 W
2020/06/10 09:30:55.030, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 37.41 W
2020/06/10 09:30:58.032, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 25.49 W
2020/06/10 09:31:01.036, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 24.59 W
2020/06/10 09:31:04.038, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 25.34 W
2020/06/10 09:31:07.039, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 41, 300 MHz, 7000 MHz, 19.72 W
2020/06/10 09:31:10.041, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 1, 41, 300 MHz, 7000 MHz, 19.58 W
2020/06/10 09:31:13.052, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 19.42 W
2020/06/10 09:31:16.056, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 41, 300 MHz, 405 MHz, 19.66 W
2020/06/10 09:31:19.058, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 2, 40, [Unknown Error], [Unknown Error], [Unknown Error]
2020/06/10 09:31:47.053, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 2, 40, [Unknown Error], [Unknown Error], [Unknown Error]
2020/06/10 09:32:16.053, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 2, 39, [Unknown Error], [Unknown Error], [Unknown Error]
2020/06/10 09:32:32.554, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 2, 39, [Unknown Error], [Unknown Error], [Unknown Error]

Terminal output provoked freeze #2:

2020/06/10 09:44:07.126, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 47, 300 MHz, 810 MHz, 20.99 W
2020/06/10 09:44:10.129, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 47, 300 MHz, 405 MHz, 18.82 W
2020/06/10 09:44:13.132, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 47, 300 MHz, 405 MHz, 19.12 W
2020/06/10 09:44:16.135, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 47, 300 MHz, 405 MHz, 19.11 W
2020/06/10 09:44:19.137, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 47, 300 MHz, 405 MHz, 19.03 W
2020/06/10 09:44:22.140, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 19.12 W
2020/06/10 09:44:25.142, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 19.06 W
2020/06/10 09:44:28.144, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 19.00 W
2020/06/10 09:44:31.147, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 19.24 W
2020/06/10 09:44:34.150, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 19.15 W
2020/06/10 09:44:37.153, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P3, 3, 46, 300 MHz, 5000 MHz, 35.63 W
2020/06/10 09:44:40.158, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 46, 300 MHz, 810 MHz, 21.74 W
2020/06/10 09:44:43.162, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 3, 46, 300 MHz, 7000 MHz, 21.20 W
2020/06/10 09:44:46.164, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 1, 46, 300 MHz, 405 MHz, 21.47 W
2020/06/10 09:44:49.166, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P0, 1, 46, 300 MHz, 7000 MHz, 21.74 W
2020/06/10 09:44:52.177, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 46, 300 MHz, 810 MHz, 19.70 W
2020/06/10 09:44:55.179, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 46, 300 MHz, 810 MHz, 19.78 W
2020/06/10 09:44:58.181, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P5, 2, 45, 300 MHz, 810 MHz, 19.82 W
2020/06/10 09:45:01.184, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 3, 45, [Unknown Error], [Unknown Error], [Unknown Error]
2020/06/10 09:45:17.595, GeForce RTX 2070, 00000000:B3:00.0, 440.64, P8, 3, 45, [Unknown Error], [Unknown Error], [Unknown Error]

So in the first case there is the switch from P8, 1 to P8, 2
In the second case there is the switch from P5, 2 to P8, 3
The freeze always occurs in state P8.
Maybe that information helps.

Greetings,

PS: In normal use I have never observed that the card is in state P8 and PCIe Gen3 at the same time. Only when the freeze occurs

4 Likes

Sounds good, we will try steps mentioned in comment #223 and will update ASAP with results.
Thank you so much for your efforts.

1 Like

We havenā€™t been able to observe the problem internally, despite multiple attempts.

Can you try harder please? Itā€™s coming up for a year, itā€™s widespread across a range of hardware and Linux distributions. There have been multiple offers of access and everyone is happy to give precise hardware configs.

Iā€™m getting tired of ā€œworks on our machineā€ responses because it indicates lack of will not difficulty. Have there been no attempts to even increase driver instrumentation to identify the issue?

3 Likes

@amrits

The 300 MHz provocation only works on systems that showed the issue. On a ā€œfine wokingā€ system forcing the 300MHz didnā€™t do anything (in my case). So if nothing happens at your systems, you might have a good one.
But the 300MHz provocoation can help in testing/finding a system that shows the issue since you donā€™t have to wait for days until the freeze occurs.

@amrits If you donā€™t find a system at all that shows the issue, I could send you one of my ā€œaffectedā€ mainboards and the corresponding graphics card. I have a very strong interest in finding the root cause/solution of this problem. Like I mentioned I sell systems to customers.

@Uli1234 I do remember rebooting into my other partition every once in a while, so there is a chance that I forgot to reapply the settings after one of the reboots.
Just to be sure, Iā€™ll keep sudo nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2115; applied until it locks up again. And if/when it does, Iā€™ll try the GpuPowerMizerMode settings.
Will report back later.

1 Like

elialbert posted a link in post 214 on how to put the commands in a startup skript. Then rebooting the system is no problem.

@han310: Yes sounds like a good plan!

Arch Linux here. Same problem since half a year.
Crashed up to once in two days.
Had no crash with up to date system for a week until today.

I catched my full system info after Xid 61 event:

pid affected nvidia-persistenced once, after disabling that service, pid now points to xorg.

I will now test disabled SMT (multithreading) and apply all other tips from this thread.

nvidia-smi-crashed.log (1.4 KB)
nvidia-smi-q-crashed.log (5.8 KB)

@ Uli1234,
We ran experiments suggested by you by locking clocks for almost 6 hours but not hit with issue, so it looks like system is not affected.

You can ship the system to Santa Clara (US) or Pune (India) based on your convenience.
Also would like to know where are you currently based on so that we can see if there is any other alternative option for you to send system to us.
Please let me know , I will provide the shipping details accordingly.
Thanks a lot for offering system which will really expedite our debug process.

@amrits: I sent you a private message

Thanksā€¦

The fix ā€œsudo nvidia-smi -lgc 1000,2145ā€ worked completely for me, thank you so much @OldToby , but I see a lot of frustration still so Iā€™m going to contribute what I can.

My workflow seemed particularly susceptible to this problem! I often have two browsers with videos playing and was crashing several times per day. I tried numerous Linux distributions (Manjaro, Ubuntu, etc.) without any change in the issue.

Only after I disabled hardware acceleration for my browsers (Chromium and Firefox), did the crashes slow to once or twice per day. But I was still experiencing some crashes just watching videos, even after turning off all desktop effects in the OS.

AMD Ryzen 5 3600X 3.8 GHz 6-Core Processor
Asus TUF GAMING X570-PLUS
EVGA GeForce RTX 2060 6 GB SC ULTRA

So if you are trying to recreate the bug, maybe playing multiple videos over an extended period of time could do it. Crashes seemed to occur on YouTube and Twitch. Hope this helps!

1 Like

The problem just happened to me again (Random Xid 61 and Xorg lock-up - #150 by carlosmerces) after 44 days.

Logs:

jun 12 16:13:02 carlos-tobefilledbyoem rtkit-daemon[1324]: Supervising 6 threads of 4 processes of 1 users.
jun 12 16:13:02 carlos-tobefilledbyoem rtkit-daemon[1324]: Supervising 6 threads of 4 processes of 1 users.
jun 12 16:13:19 carlos-tobefilledbyoem kernel: NVRM: GPU at PCI:0000:07:00: GPU-44c5cdee-5572-eb62-6d76-34ba1fa54eb2
jun 12 16:13:19 carlos-tobefilledbyoem kernel: NVRM: GPU Board Serial Number: 
jun 12 16:13:19 carlos-tobefilledbyoem kernel: NVRM: Xid (PCI:0000:07:00): 61, pid=794, 0cec(3098) 00000000 00000000

It seems its a very rare problem but dude it froze my system during my work aaaaaahhhhhhh

I too have this issue. I am able to reproduce it with some regularity using a heavy computatoin (i.e. a few cores at 100%) and switcing between applciations (Zoom/Brave).

Jun 14 11:00:47 axoneme kernel: [ 4115.637580] NVRM: Xid (PCI:0000:09:00): 61, pid=1179, 0cec(3098) 00000000 00000000

CPU: Ryzen 3900x
GPU: 2060Super
Mobo: X570 Auruos pro wifi
Mem: ripjaw ddr4 3600

@jacronand13 Could you try out the fix in post 209 if it works for you and give feedback? Thanks a lot!

Absolutely. Implemented it yesterday. Ill report back on June 21st to discuss any results.

1 Like

Does anyone also see segmentation faults in nvidia_drv.so?

In addition to occasional xid 61 I have now had this a few times with nvidia-driver-440. Could it be related?

/usr/lib/gdm3/gdm-x-session[2809]: (EE) Caught signal 11 (Segmentation fault). Server aborting
/usr/lib/gdm3/gdm-x-session[2809]: Fatal server error:
/usr/lib/gdm3/gdm-x-session[2809]: (EE)
/usr/lib/gdm3/gdm-x-session[2809]: (EE) Segmentation fault at address 0x8
/usr/lib/gdm3/gdm-x-session[2809]: (EE)
/usr/lib/gdm3/gdm-x-session[2809]: (EE) 2: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x4569ec) [0x7f97cf757cd8]
/usr/lib/gdm3/gdm-x-session[2809]: (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x50) [0x7f97d22988df]
/usr/lib/gdm3/gdm-x-session[2809]: (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x139) [0x562714d387d9]

Seeing a similar problem with the following config:

AMD Epyc 7302p
AsRock EPYCD8-2T
3 x Quadro RTX 5000
Ubuntu 18.04 server

Each of the 3 RTX cards is doing a different task, and so far the issue only appears on #0 which is the least heavily used card. For the first two days after this computer was installed the issue appeared once per day. After that I tried this:

sudo nvidia-smi -i 0 -pm ENABLED
sudo DISPLAY=:0 nvidia-settings -a ā€œ[gpu:0]/GpuPowerMizerMode=1ā€

after this the card goes to P0 for awhile, but later goes back down to P5 or perhaps lower.
After applying this change the system went 4 days with no issues and then it reappeared. Subsequently the problem reappeared again 5 minutes after the reboot, before I could reapply the settings.

Any other suggestions for keeping the card out of the lower power states?