Random Xid 61 and Xorg lock-up

I’ve had it happen again, this time I also got an Xid 8 right after the Xid 61. See attached logs
nvidia-bug-report.log.gz (70.6 KB)
dmesg.log.gz (23.4 KB)

I encountered this issue today.

Dec 15 11:21:44 muskrat kernel: NVRM: GPU at PCI:0000:09:00: GPU-5a2b009d-c14b-46e3-9865-3de04b4b0435
Dec 15 11:21:44 muskrat kernel: NVRM: GPU Board Serial Number:
Dec 15 11:21:44 muskrat kernel: NVRM: Xid (PCI:0000:09:00): 61, pid=2313, 0cb5(2d50) 00000000 00000000

System details:
Asus Pro WS X570-ACE
Ryzen 3900X
EVGA RTX 2070
Debian unstable
Kernel 5.2.14
Driver 430.64
Sawfish window manager, no compositor or desktop environment

When the Xid occurred I was watching the video on Save 90% on CONTRA: ROGUE CORPS on Steam and had another browser window running cookie clicker (on a different workspace, so not visible).

In my case the system did not freeze entirely, but any OpenGL activity caused everything to become very sluggish. Non-OpenGL applications ran normally when nothing was using OpenGL. Glxgears ran at under 10 fps. Despite the sluggishness I was able to close my programs normally. After I shut down Xorg I got a few instances of:

nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.

I then rebooted because evidently the driver/GPU was not going to recover without.

I’ve had this Ryzen system for several weeks and this is the first time I’ve seen any Xid errors. Besides browsing the web I’ve also played Final Fantasy XIV and used Unity.

The same GPU was previously installed on a system with Asus Z170 Pro Gaming and Core i7-6700k. There were some stability issues, but those were more likely caused by other components, since they also occurred with a GTX 980 earlier. I never saw any Xid messages with that configuration, with either GPU, across a wide range of driver versions.

I have been running video clips on youtube; spotify along with few benchmarkslike glmark2 on below setup from last 5 weeks but not able to replicate issue so far.

MPG X570 GAMING EDGE WIFI (MS-7C37)
AMD Ryzen 7 3700X 8-Core Processor
Driver 430.34
GeForce RTX 2070

I have also connected 2 displays on setup.

Can someone please try with latest driver release and see if it fixes the issue.

I have updated my bios from F40 to F50 2 weeks ago and I have not received a crash since then.

Last week I updated to driver 440.44 after my last XID 61 error. It happened again yesterday, so that’s not a fix. I updated my BIOS to the latest version after the last error (Asus Prime X570-Pro, Updated Bios to 1405.)

Will let you know how that goes.

If amrits, or someone else at NVidia, could tell us more about XID 61, perhaps one of us could figure out how to reliably reproduce the issue. I’m flying blind and it seems to happen at random. It also seems to only happen with a lightly loaded system.

Amrits, are you running your tests continuously and in parallel? If so, you might want to change the tests to not be in parallel with idle time in-between.

I hit it again today, so updating the BIOS to the latest version didn’t help. It happened while I was using my Windows VM. General information:

Dec 19 14:42:15 titan2 kernel: [168710.936696] NVRM: GPU at PCI:0000:08:00: GPU-9c1e2d3f-5bf1-9e58-dbcb-9350c03802bb
Dec 19 14:42:15 titan2 kernel: [168710.936701] NVRM: GPU Board Serial Number:
Dec 19 14:42:15 titan2 kernel: [168710.936708] NVRM: Xid (PCI:0000:08:00): 61, pid=1822, 0cec(3098) 00000000 00000000

pid 1822 is X, here’s what I see in top after this happens:

1822 root 20 0 360720 99508 70092 R 99.3 0.1 52:36.83 Xorg
1851 root -51 0 0 0 0 R 97.4 0.0 6:37.58 irq/117-nvidia

Note, both of those are about 100%.

Once this happens, I can’t read information from the video card:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 206… Off | 00000000:08:00.0 On | N/A |
|ERR! 38C P5 ERR! / 215W | 326MiB / 7979MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1822 G /usr/lib/xorg/Xorg 259MiB |
±----------------------------------------------------------------------------+

Note, the ERR! above. I’ll attach two bug reports, the first time it happened and today’s.

The only other thing I can think of trying is changing the power setting from auto to prefer maximum performance. Which I’ve done today after the last crash. Does anyone else have any other ideas?

Amrits, any other information I can gather after this happens?

Thanks,
John
nvidia-bug-report-Dec19.log.gz (477 KB)
nvidia-bug-report-Dec4.log.gz (635 KB)

Hi jadams0n2u3,

xid 61 error usually occcurs due to internal micro-controller breakpoint/warning.

I have been running applications and some time keeping system idle but still not hit with the issue.

MPG X570 GAMING EDGE WIFI (MS-7C37)
AMD Ryzen 7 3700X 8-Core Processor
Driver 430.34
GeForce RTX 2070

amrits, please check with an ASUS board. MSI doesn’t seem to be affected, Gigabyte has fixed this and other things with a bios update.

I have the same Xid 61 issue on Asus x570 tuf gaming and ryzen 3700x. My card RTX 2070 worked fine for 6 months on my old FX 8350 system so I know it’s good. After trying out different kernels and bios revisions came to the same conclusion as @jadams0n2u3, this looks like a power savings edge case for x570 platforms for whatever reason.

Nvidia-smi shows power usage of ~14-17W when system is on idle/low performance tasks. Error usually happens when watching youtube with web browser, I suspect card might not boosting its power just in time and error occurs (I have a 4k display if that matters in this case) and system slows down to a halt. After setting Prefer Maximum Performance in Nvidia control panel, card still sometimes got into power savings stages so I put

nvidia-settings -a [gpu:0]/GPUPowerMizerMode=1

in .xsessionrc and now card always stays at max performance level and nvidia-smi shows power usage ~45W on idle/low performance tasks. Error haven’t occurred since but I haven’t tested long enough to be confident enough this workaround always solves it, still worth a try for people hitting this problem.

I just had this occur again. Configuration same as in my previous post, except I had updated the Nvidia driver to version 440.36.

Dec 31 14:32:59 muskrat kernel: NVRM: GPU at PCI:0000:09:00: GPU-5a2b009d-c14b-46e3-9865-3de04b4b0435
Dec 31 14:32:59 muskrat kernel: NVRM: GPU Board Serial Number:
Dec 31 14:32:59 muskrat kernel: NVRM: Xid (PCI:0000:09:00): 61, pid=2322, 0cec(3098) 00000000 00000000
Dec 31 14:36:40 muskrat kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Dec 31 14:36:48 muskrat kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.

The PID refers to the Xorg server process.

I had finished playing Final Fantasy XIV (with Wine) a few minutes earlier and was about to browse its forums with Opera when it became sluggish. However this was not the first time I did either of those things since the previous reboots.

I have updated to Nvidia driver 440.44 and ASUS BIOS 1201 now.

This issue seems random enough that a reliable repro is unlikely. Is there any additional data we could collect to help in resolving this?

Doing more testing, I’m getting Xid 61 and sometimes Xid 62 errors while playing Quake II RTX. Errors happen quite randomly sometimes 5 mins of play time and sometimes happen after an hour. System hangs with a garbled play screen.

ASUS TUF GAMING X570-PLUS (WI-FI) BIOS 1405
AMD Ryzen 7 3700X 8-Core Processor
Driver 440.44
GeForce RTX 2070
Debian Bullseye

edit: this maybe related to thermals, manually cranked up fan to higher levels, so far hasn’t occurred again. testing further

Had this repro for me again. This time it took 22 days of up time. Currently running arch-linux on kernel 5.4.2 with nvidia 440.44 driver. For machine specs see this thread: https://devtalk.nvidia.com/default/topic/1065106/linux/-bug-nvidia_modeset-causes-kernel-5-xorg-crash-on-rtx-2070-super-card/post/5393516/?offset=4#5410267.

And there it is again. This time it seems to have been triggered by opening a Trello link with Opera.

amrits, would it be useful for you to get remote access to a system exhibiting the symptoms, or do you need physical access to the GPU or specialized hardware?

Another observation: I get problems with DisplayPort monitors if I do a warm reboot after such an Xid error. A cold reboot fixes things. See here for details: https://devtalk.nvidia.com/default/topic/1069157/linux/lost-output-to-one-dp-monitor-after-mainboard-bios-update/

Hi tdb,

It would be great to have remote access of your system as I am still not able to replicate issue locally.

All right, I will set it up and let you know the next time it happens. Is a private message here a good way to communicate the details or would you prefer email?

Yes, we can communicate via private message

FYI hand another repro of this, this time only after 4 days. Not sure if it helps, but xorg and chrome go spinning at 100% when this occurs. If I restart the lightdm service, I get warnings in linux journal saying that the nvidia-modeset lost display notifications for GPU:0.

Hi All,

Sysadmin at a small 3D animation studio here. We recently purchased 30 new workstations with these specs.

System details:
Asus Prime X570-P
Ryzen 3900X 3.8Ghz
Asus Geforce RTX 2070 Super
64GB DDR4 (4X16)
Centos 7.5.1804
Kernel 5.4.0
Nvidia Driver 430.26
LightDM + Mate Desktop

The issues we’ve been getting are exactly the same ones mentioned in this post (station freezes / Xid 61 lockup ). Issue happens randomly. Doesn’t seem to be a correlation between heavy GPU/CPU usage and issue occuring. Often our users would freeze while doing basic emails or browsing in chrome. Do note we have 1 of those workstations setup as a headless server (runlevel 3) and it hasnt frozen at all. Since its not interacting much with the GPU i guess its just not triggering a freeze. We can only reproduce the issue with stations in runlevel 5 + user interaction.

The troubleshooting steps we took :

  • tried playing in the BIOS settings, we activated/deactivated pretty much every possible feature of the ASUS board - no luck. We also played heavily with the boards PCIE settings - no luck there.
  • tried updating the BIOS to the latest version - issue persists
  • We initially were running Centos 7.5.xxx with the standard kernel V3.10.xxx, tried kernel 4.xxx, issue persists, tried kernel 5.10.xx, issue persists.
  • We initially were running nvidia-smi version 410.xxx, we worked our way up incrementally to 430.26, issue was reproduced regardless of nvidia-smi versions.
  • we tried differents GPU’s : RTX 2070 Super, RTX 2070, RTX Quadro 4000 - they all froze (All turing architecture).
  • Once we tried older cards (GTX 1080, GTX 1060, GTX 980, GTX 780) (Pascal, Maxwell and Kepler architectures) we stopped getting the issue. Everything ran smooth and freezing was gone.
  • We ordered previous generation boards (MSI B450-pro) and updated the bios to make it compatible with the new Ryzens + we put the 2070 Supers back in. All issues resolved. We havent gotten a freeze in 2 weeks on those 2 B450 stations (regardless of kernel version,with a B450 it works fine even on kernel 3.10.xxx)

With this kind of isolation done, its pretty safe to say the X-570 boards used with Turing cards are the cause of the issue. Since we have the budget we went ahead and replaced all our X-570 boards with MSI B450-Pros.

The uncertains :

  • not sure if changing to a different brand than ASUS (MSI, Gigabyte) but staying in X-570 gen would have worked
  • not sure if ASUS 450 or 470 is also affected

It may not be an option for some you as changing board can be costly, but at least do know it is a working solution. We’re happy to share more or answer questions if you have any.

sysadminfm9rx: i’m experiencing this issue on a rog strix b450-f board with an athlon 3900x and a RTX 2080 ti.