Random Xid 61 and Xorg lock-up

Oh wait, I skirmished through that, its not an ASUS board.

That was only a theory so far.
So what model/brand is it? Or is this a well kept secret?

Gygabyte motherboard, tho the RTX 2060 graphics is an ASUS version.

Iā€™ve been having the same issue. Ryzen 3700x, RTX 2070 Super. Iā€™m also using dual monitors, it seems like it could be related to that. It usually happens after 5-20 days of uptime in the middle of the night.

Iā€™ve been hitting this problem several times in the past week on a system using an AMD 3900X, Asus Prime x570 Pro, Nvidia RTX 2060 Super, and two monitors. X just kind of hangs, sometimes allows for a little bit of movement, then hangs. Happens with this in the log (Ubuntu Mate 18.04:)

Dec  7 04:52:12 hostname kernel: [157522.619413] NVRM: GPU at PCI:0000:08:00: GPU-9c1e2d3f-5bf1-9e58-dbcb-9350c03802bb
Dec  7 04:52:12 hostname kernel: [157522.619417] NVRM: GPU Board Serial Number: 
Dec  7 04:52:12 hostname kernel: [157522.619422] NVRM: Xid (PCI:0000:08:00): 61, pid=1808, 0cde(308c) 00000000 00000000

The system is still running, i.e. I can log in over ssh and any non-GUI tasks are still working fine. It can happen while Iā€™m using it or overnight (with screen locked and blank screensaver.) When it happens, my VMware Player Windows VM guest also hangs (its display is on the second monitor, full screen.)

Amrits, we might be able to provide more information on the issue if you could describe what Xid 61 means. All I could find from the documentation is the following:

Internal micro-controller breakpoint/warning

Is this actually a hardware problem? Should everything halt when this error is thrown?

Thanks!

My issue is probably related. https://devtalk.nvidia.com/default/topic/1065106/linux/-bug-nvidia_modeset-causes-kernel-5-xorg-crash-on-rtx-2070-super-card/post/5393516/?offset=4#5410267

Iā€™ve had it happen again, this time I also got an Xid 8 right after the Xid 61. See attached logs
nvidia-bug-report.log.gz (70.6 KB)
dmesg.log.gz (23.4 KB)

I encountered this issue today.

Dec 15 11:21:44 muskrat kernel: NVRM: GPU at PCI:0000:09:00: GPU-5a2b009d-c14b-46e3-9865-3de04b4b0435
Dec 15 11:21:44 muskrat kernel: NVRM: GPU Board Serial Number:
Dec 15 11:21:44 muskrat kernel: NVRM: Xid (PCI:0000:09:00): 61, pid=2313, 0cb5(2d50) 00000000 00000000

System details:
Asus Pro WS X570-ACE
Ryzen 3900X
EVGA RTX 2070
Debian unstable
Kernel 5.2.14
Driver 430.64
Sawfish window manager, no compositor or desktop environment

When the Xid occurred I was watching the video on Save 90% on CONTRA: ROGUE CORPS on Steam and had another browser window running cookie clicker (on a different workspace, so not visible).

In my case the system did not freeze entirely, but any OpenGL activity caused everything to become very sluggish. Non-OpenGL applications ran normally when nothing was using OpenGL. Glxgears ran at under 10 fps. Despite the sluggishness I was able to close my programs normally. After I shut down Xorg I got a few instances of:

nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.

I then rebooted because evidently the driver/GPU was not going to recover without.

Iā€™ve had this Ryzen system for several weeks and this is the first time Iā€™ve seen any Xid errors. Besides browsing the web Iā€™ve also played Final Fantasy XIV and used Unity.

The same GPU was previously installed on a system with Asus Z170 Pro Gaming and Core i7-6700k. There were some stability issues, but those were more likely caused by other components, since they also occurred with a GTX 980 earlier. I never saw any Xid messages with that configuration, with either GPU, across a wide range of driver versions.

I have been running video clips on youtube; spotify along with few benchmarkslike glmark2 on below setup from last 5 weeks but not able to replicate issue so far.

MPG X570 GAMING EDGE WIFI (MS-7C37)
AMD Ryzen 7 3700X 8-Core Processor
Driver 430.34
GeForce RTX 2070

I have also connected 2 displays on setup.

Can someone please try with latest driver release and see if it fixes the issue.

I have updated my bios from F40 to F50 2 weeks ago and I have not received a crash since then.

Last week I updated to driver 440.44 after my last XID 61 error. It happened again yesterday, so thatā€™s not a fix. I updated my BIOS to the latest version after the last error (Asus Prime X570-Pro, Updated Bios to 1405.)

Will let you know how that goes.

If amrits, or someone else at NVidia, could tell us more about XID 61, perhaps one of us could figure out how to reliably reproduce the issue. Iā€™m flying blind and it seems to happen at random. It also seems to only happen with a lightly loaded system.

Amrits, are you running your tests continuously and in parallel? If so, you might want to change the tests to not be in parallel with idle time in-between.

I hit it again today, so updating the BIOS to the latest version didnā€™t help. It happened while I was using my Windows VM. General information:

Dec 19 14:42:15 titan2 kernel: [168710.936696] NVRM: GPU at PCI:0000:08:00: GPU-9c1e2d3f-5bf1-9e58-dbcb-9350c03802bb
Dec 19 14:42:15 titan2 kernel: [168710.936701] NVRM: GPU Board Serial Number:
Dec 19 14:42:15 titan2 kernel: [168710.936708] NVRM: Xid (PCI:0000:08:00): 61, pid=1822, 0cec(3098) 00000000 00000000

pid 1822 is X, hereā€™s what I see in top after this happens:

1822 root 20 0 360720 99508 70092 R 99.3 0.1 52:36.83 Xorg
1851 root -51 0 0 0 0 R 97.4 0.0 6:37.58 irq/117-nvidia

Note, both of those are about 100%.

Once this happens, I canā€™t read information from the video card:

Ā±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------Ā±---------------------Ā±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 206ā€¦ Off | 00000000:08:00.0 On | N/A |
|ERR! 38C P5 ERR! / 215W | 326MiB / 7979MiB | 0% Default |
Ā±------------------------------Ā±---------------------Ā±---------------------+

Ā±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1822 G /usr/lib/xorg/Xorg 259MiB |
Ā±----------------------------------------------------------------------------+

Note, the ERR! above. Iā€™ll attach two bug reports, the first time it happened and todayā€™s.

The only other thing I can think of trying is changing the power setting from auto to prefer maximum performance. Which Iā€™ve done today after the last crash. Does anyone else have any other ideas?

Amrits, any other information I can gather after this happens?

Thanks,
John
nvidia-bug-report-Dec19.log.gz (477 KB)
nvidia-bug-report-Dec4.log.gz (635 KB)

Hi jadams0n2u3,

xid 61 error usually occcurs due to internal micro-controller breakpoint/warning.

I have been running applications and some time keeping system idle but still not hit with the issue.

MPG X570 GAMING EDGE WIFI (MS-7C37)
AMD Ryzen 7 3700X 8-Core Processor
Driver 430.34
GeForce RTX 2070

amrits, please check with an ASUS board. MSI doesnā€™t seem to be affected, Gigabyte has fixed this and other things with a bios update.

I have the same Xid 61 issue on Asus x570 tuf gaming and ryzen 3700x. My card RTX 2070 worked fine for 6 months on my old FX 8350 system so I know itā€™s good. After trying out different kernels and bios revisions came to the same conclusion as @jadams0n2u3, this looks like a power savings edge case for x570 platforms for whatever reason.

Nvidia-smi shows power usage of ~14-17W when system is on idle/low performance tasks. Error usually happens when watching youtube with web browser, I suspect card might not boosting its power just in time and error occurs (I have a 4k display if that matters in this case) and system slows down to a halt. After setting Prefer Maximum Performance in Nvidia control panel, card still sometimes got into power savings stages so I put

nvidia-settings -a [gpu:0]/GPUPowerMizerMode=1

in .xsessionrc and now card always stays at max performance level and nvidia-smi shows power usage ~45W on idle/low performance tasks. Error havenā€™t occurred since but I havenā€™t tested long enough to be confident enough this workaround always solves it, still worth a try for people hitting this problem.

I just had this occur again. Configuration same as in my previous post, except I had updated the Nvidia driver to version 440.36.

Dec 31 14:32:59 muskrat kernel: NVRM: GPU at PCI:0000:09:00: GPU-5a2b009d-c14b-46e3-9865-3de04b4b0435
Dec 31 14:32:59 muskrat kernel: NVRM: GPU Board Serial Number:
Dec 31 14:32:59 muskrat kernel: NVRM: Xid (PCI:0000:09:00): 61, pid=2322, 0cec(3098) 00000000 00000000
Dec 31 14:36:40 muskrat kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Dec 31 14:36:48 muskrat kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.

The PID refers to the Xorg server process.

I had finished playing Final Fantasy XIV (with Wine) a few minutes earlier and was about to browse its forums with Opera when it became sluggish. However this was not the first time I did either of those things since the previous reboots.

I have updated to Nvidia driver 440.44 and ASUS BIOS 1201 now.

This issue seems random enough that a reliable repro is unlikely. Is there any additional data we could collect to help in resolving this?

Doing more testing, Iā€™m getting Xid 61 and sometimes Xid 62 errors while playing Quake II RTX. Errors happen quite randomly sometimes 5 mins of play time and sometimes happen after an hour. System hangs with a garbled play screen.

ASUS TUF GAMING X570-PLUS (WI-FI) BIOS 1405
AMD Ryzen 7 3700X 8-Core Processor
Driver 440.44
GeForce RTX 2070
Debian Bullseye

edit: this maybe related to thermals, manually cranked up fan to higher levels, so far hasnā€™t occurred again. testing further

Had this repro for me again. This time it took 22 days of up time. Currently running arch-linux on kernel 5.4.2 with nvidia 440.44 driver. For machine specs see this thread: https://devtalk.nvidia.com/default/topic/1065106/linux/-bug-nvidia_modeset-causes-kernel-5-xorg-crash-on-rtx-2070-super-card/post/5393516/?offset=4#5410267.

And there it is again. This time it seems to have been triggered by opening a Trello link with Opera.

amrits, would it be useful for you to get remote access to a system exhibiting the symptoms, or do you need physical access to the GPU or specialized hardware?

Another observation: I get problems with DisplayPort monitors if I do a warm reboot after such an Xid error. A cold reboot fixes things. See here for details: https://devtalk.nvidia.com/default/topic/1069157/linux/lost-output-to-one-dp-monitor-after-mainboard-bios-update/