Random Xid 61 and Xorg lock-up

@ Uli1234,
We ran experiments suggested by you by locking clocks for almost 6 hours but not hit with issue, so it looks like system is not affected.

You can ship the system to Santa Clara (US) or Pune (India) based on your convenience.
Also would like to know where are you currently based on so that we can see if there is any other alternative option for you to send system to us.
Please let me know , I will provide the shipping details accordingly.
Thanks a lot for offering system which will really expedite our debug process.

@amrits: I sent you a private message

Thanks…

The fix “sudo nvidia-smi -lgc 1000,2145” worked completely for me, thank you so much @OldToby , but I see a lot of frustration still so I’m going to contribute what I can.

My workflow seemed particularly susceptible to this problem! I often have two browsers with videos playing and was crashing several times per day. I tried numerous Linux distributions (Manjaro, Ubuntu, etc.) without any change in the issue.

Only after I disabled hardware acceleration for my browsers (Chromium and Firefox), did the crashes slow to once or twice per day. But I was still experiencing some crashes just watching videos, even after turning off all desktop effects in the OS.

AMD Ryzen 5 3600X 3.8 GHz 6-Core Processor
Asus TUF GAMING X570-PLUS
EVGA GeForce RTX 2060 6 GB SC ULTRA

So if you are trying to recreate the bug, maybe playing multiple videos over an extended period of time could do it. Crashes seemed to occur on YouTube and Twitch. Hope this helps!

1 Like

The problem just happened to me again (Random Xid 61 and Xorg lock-up) after 44 days.

Logs:

jun 12 16:13:02 carlos-tobefilledbyoem rtkit-daemon[1324]: Supervising 6 threads of 4 processes of 1 users.
jun 12 16:13:02 carlos-tobefilledbyoem rtkit-daemon[1324]: Supervising 6 threads of 4 processes of 1 users.
jun 12 16:13:19 carlos-tobefilledbyoem kernel: NVRM: GPU at PCI:0000:07:00: GPU-44c5cdee-5572-eb62-6d76-34ba1fa54eb2
jun 12 16:13:19 carlos-tobefilledbyoem kernel: NVRM: GPU Board Serial Number: 
jun 12 16:13:19 carlos-tobefilledbyoem kernel: NVRM: Xid (PCI:0000:07:00): 61, pid=794, 0cec(3098) 00000000 00000000

It seems its a very rare problem but dude it froze my system during my work aaaaaahhhhhhh

I too have this issue. I am able to reproduce it with some regularity using a heavy computatoin (i.e. a few cores at 100%) and switcing between applciations (Zoom/Brave).

Jun 14 11:00:47 axoneme kernel: [ 4115.637580] NVRM: Xid (PCI:0000:09:00): 61, pid=1179, 0cec(3098) 00000000 00000000

CPU: Ryzen 3900x
GPU: 2060Super
Mobo: X570 Auruos pro wifi
Mem: ripjaw ddr4 3600

@jacronand13 Could you try out the fix in post 209 if it works for you and give feedback? Thanks a lot!

Absolutely. Implemented it yesterday. Ill report back on June 21st to discuss any results.

1 Like

Does anyone also see segmentation faults in nvidia_drv.so?

In addition to occasional xid 61 I have now had this a few times with nvidia-driver-440. Could it be related?

/usr/lib/gdm3/gdm-x-session[2809]: (EE) Caught signal 11 (Segmentation fault). Server aborting
/usr/lib/gdm3/gdm-x-session[2809]: Fatal server error:
/usr/lib/gdm3/gdm-x-session[2809]: (EE)
/usr/lib/gdm3/gdm-x-session[2809]: (EE) Segmentation fault at address 0x8
/usr/lib/gdm3/gdm-x-session[2809]: (EE)
/usr/lib/gdm3/gdm-x-session[2809]: (EE) 2: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (nvidiaAddDrawableHandler+0x4569ec) [0x7f97cf757cd8]
/usr/lib/gdm3/gdm-x-session[2809]: (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x50) [0x7f97d22988df]
/usr/lib/gdm3/gdm-x-session[2809]: (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x139) [0x562714d387d9]

Seeing a similar problem with the following config:

AMD Epyc 7302p
AsRock EPYCD8-2T
3 x Quadro RTX 5000
Ubuntu 18.04 server

Each of the 3 RTX cards is doing a different task, and so far the issue only appears on #0 which is the least heavily used card. For the first two days after this computer was installed the issue appeared once per day. After that I tried this:

sudo nvidia-smi -i 0 -pm ENABLED
sudo DISPLAY=:0 nvidia-settings -a “[gpu:0]/GpuPowerMizerMode=1”

after this the card goes to P0 for awhile, but later goes back down to P5 or perhaps lower.
After applying this change the system went 4 days with no issues and then it reappeared. Subsequently the problem reappeared again 5 minutes after the reboot, before I could reapply the settings.

Any other suggestions for keeping the card out of the lower power states?

@jameskzd28

1.) Are you sure the system didn’t reboot during that 4 days? Would help putting the setting for PowerMizer in a startup script

2.) When the issue appeared again after 4 days, are you sure it’s your card #0 that triggerd the issue? Might be a try to apply the PowerMizerSetting to all three cards

3.) Instead of PowerMizer you could try to lock the frequencies as explained in this thread. For example 1000MHz-2000MHz. In my case when having a min. freq of 1000MHz the card always stays in P0.

@Uli1234 thanks for all of the work you’re doing to track down this issue

1.) I’m sure the system didn’t reboot during the 4 day period. I ran a simple script to monitor the situation and that script would not have been running if the system had been rebooted.

2.) I’m 90% sure it was still card #0 but I’ll check again next time to be sure. Just yesterday after I posted I saw that card #0 went to P8 briefly a couple hours after the PowerMizer setting was applied.

3.) That’s a good suggestion, thanks.

Just to follow up, the issue reappeared just now and it was still with card #0, with no intervening reboots. Going to try the frequency locking approach now.

@Uli1234 The Xid 61 error happened to me again just now. And I’m sure that the settings were applied.

Also, I experienced a different segmentation fault error recently, like @dawdaw did.

@jameskzd28 Not good news. I would try the following now:

1.) Set persistence mode
2,) Set PowerMizerMode (Max Performance)
3.) And in addition lock the frequency to min of 1000MHz

Double layer approach. Might be worth a try…

@han310 Thank you for the feedback. Could you try to set MaxPerformanceMode as well as locking the GPU frequencies?

Just confirming this, After switching back to my GTX 1060 from my RTX 2070 Super, I’ve had no slow downs for 36 days.

Also fwiw, I think something is still glitchy with the GTX card based on the the fact that chrome now freezes up every now and then and I have to restart it. I suspect where the RTX card would cause the entire system to grind to a halt, the GTX is more stable and only the application in use is affected. Maybe something to do with the turing architecture code?

@OldToby What’s your status? Did the issue occur again at your system?

Have this issue on both Windows and Linux. As others have pointed out, it only seems to happen when the GPU goes into a low power state. Doesn’t seem to happen if forced to stay in low power, also doesn’t happen if kept in a high power state. Most reliable way I can reproduce it is by doing stuff that causes the GPU to flip flop in and out of low power, but even then, it still seems very random.

For me, generally happens every 1-2 weeks or so.

Ryzen 7 3700X
MSI RTX 2070 Super Ventus GP