Ubuntu 20.04 randomly freezes screen with Xid 62 last message in syslog

Hey!

Been dealing with an issue on Ubuntu 20.04 since I upgraded a few weeks ago. There are a few ways to trigger the issue, but the easiest seems to be opening Steam and starting up a game (picked up Shadow of War recently, was enjoying it a ton). When the game starts up, the screen freezes and the system halts. In the frozen state, trying to open a tty does not work and there is no response on the screen. In the end, the only way to solve the halted machine is to manually power it off and turn it back on.

While steam seems to be the most consistent way to trigger the issue, it has also come up while just browsing the internet (chrome), using communications apps for work (slack) or listening to music. The issue seems to be much less frequent when running these examples then it is when opening Steam and playing a game.

For context on the hardware I have I’ve added a list below, let me know if there’s anything else I’m missing that would help.

  • Ryzen 2950X CPU
  • Asrock X399 Taichi motherboard
  • GeForce RTX 2080 Ti GAMING OC 11G GPU
  • Two monitors connected via Displayports on GPU

I have tested multiple versions of Ubuntu (19.10, 20.04) running linux kernel 5.4.0-31-generic (from Ubuntu) and multiple versions of the nvidia drivers (435.21, 440.64). All seem to have the same issue.

When I dug into it a bit more, I noticed the following line was roughly the last one in the syslog before my computer would crash.

May 25 11:52:29 lennox-desktop kernel: [  472.904088] NVRM: GPU at PCI:0000:42:00: GPU-1a8a747a-5b8c-8457-f4f8-df755aeba5a9
May 25 11:52:29 lennox-desktop kernel: [  472.904093] NVRM: GPU Board Serial Number: 
May 25 11:52:29 lennox-desktop kernel: [  472.904097] NVRM: Xid (PCI:0000:42:00): 62, pid=1430, 203c(3090) 00000000 00000000
May 25 11:52:33 lennox-desktop kernel: [  477.402355] NVRM: Xid (PCI:0000:42:00): 31, pid=1430, Ch 00000020, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_RAST faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

While the ticket linked below seemed similar, the fact that the Xid Error was 62 and not 61 led me to create a new ticket here. Random Xid 61 and Xorg lock-up

Based on a quick read of the docs at the link below, the Xid 62 error seems to be either a hardware fault, a driver fault or a thermal issue. I don’t think it’s a thermal issue based on watching the output from nvidia-smi and it reading around 45 degrees Celcius up until the screen freezes and system halts. In case it’s a hardward issue, I’m also going to reach out to Gigabyte to see what help I can get from them in this issue. To make sure I’m covering all my bases though, is there anything I can do to confirm whether or not this is a driver issue or a hardware issue?

Appreciate any and all help :)

nvidia-bug-report.log (1.6 MB)

Might be defective video memory, please use cuda-memtest to check
https://sourceforge.net/projects/cudagpumemtest/

Addendum: you can also use gpu-burn to check for general hw defects.