Frequent display lockups and Xid messages on Precision 5550 with Arch Linux, Quadro T2000, 465.27, CUDA 11.3

I have a brand new Dell Precision 5550 from the office that I installed Arch Linux on this week. It has a Quadro T2000 Mobile, kernel 5.12.1, driver 465.27 and CUDA 11.3. Those are the latest available in the Arch official repos. The system has an Intel GPU as well, and I’m using xrandr to have the NVIDIA GPU be an output provider. I.e.

xrandr --setprovideroutputsource modesetting NVIDIA-0
xrandr --auto

When working with the system I keep getting intermittent display lockups and/or screen corruption (while the system can still be accessed over SSH), usually during application use that stresses the NVIDIA GPU, but sometimes also while working in a terminal without any GPU load. I see several different Xid errors in the logs and even had a weird CUDA error in Blender during rendering (something about an illegal instruction). Over the 30 reboots so far I get these issues in 7 cases:

paulm@l0420007 10:34:~$ sudo journalctl |grep -E '(Xid)|(-- Boot)'
-- Boot e385e1e1750c48d89828b099c29388d5 --
-- Boot e2305727e2434ce89313ed9363d36062 --
-- Boot 9a0cce15079843deb0ebe0a2e6cb7c7e --
-- Boot 2ac3065fd80b479f8d9fb07627171554 --
-- Boot 7d4251a180b34944ac5f32b87a06e174 --
-- Boot 6beabe619ba84cd9a5ad8fe441818b1a --
-- Boot 54d78ae04b9b45ba94f1ce66b6e3402f --
-- Boot 31895135b50c4a169924676dc09dcaa2 --
-- Boot 00d44a58215c42658276417fb12f35e2 --
-- Boot e66f9cded1f249a6a78fd4a1701be21c --
-- Boot e2cbcd5978f74d40a7353e70f0ba4ec0 --
-- Boot 3e8142ab4751489587d6c17147e82d31 --
-- Boot bd9e23ebec6a45eeb57847614082c241 --
-- Boot aac370ae48354f74b0a05d4585968035 --
May 06 13:54:17 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 62, pid=871, 21a5(31c4) 00000000 00000000
May 06 13:54:18 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=102287, Ch 00000018, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x1_16c38000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 06 13:54:18 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=818, Ch 0000000a, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST_CPU faulted @ 0x1_0009f000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 06 13:54:18 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 44, pid=818, Ch 00000008, intr 00000000
May 06 13:54:18 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=818, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS HUBCLIENT_FECS faulted @ 0x1_001c4000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 06 13:54:18 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=818, Ch 00000000, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST faulted @ 0x1_00063000. Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_READ
May 06 13:54:18 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=818, Ch 0000000b, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x1_00011000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 06 13:54:23 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=818, Ch 00000001, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x1_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 06 13:54:23 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 44, pid=102373, Ch 00000008, intr 00000000
-- Boot 6f933bee005049728b37bc3c94d1fa74 --
May 06 14:00:35 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 62, pid=614, 21a5(31c4) 00000000 00000000
May 06 14:04:09 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 44, pid=606, Ch 00000008, intr 00000000
May 06 14:04:09 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=606, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS HUBCLIENT_FECS faulted @ 0x1_0015e000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
May 06 14:04:09 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=606, Ch 0000000b, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x1_00011000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 06 14:04:11 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=606, Ch 00000001, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x1_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
-- Boot a5e27fff50de46f7b833f19f3e082afe --
-- Boot 93bbab8c03be43479e5adb4a750a56c0 --
-- Boot 49032a15aae349bb9a28fff18d7733b9 --
-- Boot 4cac6cc215664d909867f5f156b7262c --
-- Boot 8025e762229540e3b8eeec710f7bacc0 --
-- Boot 554d83e882cc461da2ab7e419a92d110 --
May 06 14:30:55 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=634, Graphics SM Warp Exception on (GPC 1, TPC 1, SM 0): Illegal Instruction Encoding
May 06 14:30:55 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=634, Graphics Exception: ESR 0x50cf30=0x50009 0x50cf34=0x20 0x50cf28=0x4c1eb72 0x50cf2c=0x174
May 06 14:30:55 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 43, pid=1215, Ch 00000013
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=1625, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Illegal Instruction Encoding
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=1625, Graphics Exception: ESR 0x504730=0x40009 0x504734=0x20 0x504728=0x4c1eb72 0x50472c=0x174
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=1625, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 0): Illegal Instruction Encoding
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=1625, Graphics Exception: ESR 0x50c730=0xc0009 0x50c734=0x20 0x50c728=0x4c1eb72 0x50c72c=0x174
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=1625, Graphics SM Warp Exception on (GPC 1, TPC 1, SM 1): Illegal Instruction Encoding
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=1625, Graphics Exception: ESR 0x50cfb0=0x190009 0x50cfb4=0x20 0x50cfa8=0x4c1eb72 0x50cfac=0x174
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 43, pid=1625, Ch 00000013
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 61, pid=1625, 0d20(31cc) 00000000 00000000
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1625, Ch 00000013
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1625, Ch 00000014
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1625, Ch 00000015
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1625, Ch 00000016
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1625, Ch 00000017
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1625, Ch 00000018
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1625, Ch 00000019
May 06 14:32:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1625, Ch 0000001a
May 06 14:33:42 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 44, pid=622, Ch 00000008, intr 00000000
May 06 14:33:42 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=622, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS HUBCLIENT_FECS faulted @ 0x1_0015e000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
May 06 14:33:42 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=622, Ch 00000008, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x1_00011000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 06 14:33:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=622, Ch 00000001, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x1_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 06 14:33:46 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=1747, Ch 00000009, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x1_003ef000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 06 14:33:46 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 44, pid=1747, Ch 00000008, intr 00000000
May 06 14:33:48 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=622, Ch 00000000, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST faulted @ 0x1_00061000. Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_READ
-- Boot 8814d2874d914a81bcab8d863af73233 --
-- Boot 79232354a059409e8a760c5298011cbf --
May 06 14:56:53 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 62, pid=632, 21a5(31c4) 00000000 00000000
-- Boot 92f9188087df477f90a8b330cf480d52 --
May 06 15:08:13 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 62, pid=616, 21a5(31c4) 00000000 00000000
May 06 15:08:27 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 8, pid=605, Channel 00000008
May 06 15:10:08 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=605, Ch 0000000b, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
-- Boot 17545bfd84c94cef87be15cb2313d70f --
-- Boot 4e148e9f5d474d9dbb5f75bf0074596f --
May 06 16:50:47 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 62, pid=609, 21a5(31c4) 00000000 00000000
-- Boot 00c9c19271a44d8ba87206ed597b6168 --
-- Boot af4b2ed1b931489ebf6648c019fe0922 --
-- Boot 68b33289e65146a98a639541c74eef2c --
May 07 10:33:24 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 62, pid=697, 21a5(31c4) 00000000 00000000
May 07 10:33:36 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 8, pid=1586, Channel 00000010
May 07 10:33:37 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=691, Ch 0000000a, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST_CPU faulted @ 0x1_0009f000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 07 10:33:37 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=691, Ch 00000001, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x1_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
May 07 10:33:37 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=691, Ch 00000008, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x1_00003000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

Attached is the bug report output when the latest display lockup happened, during a Blender CUDA render. Given the frequency and nature of these issues should I assume a hardware (i.e. GPU error), or can this be caused by the current set of driver/kernel/CUDA/etc versions?

nvidia-bug-report.log.gz (319.6 KB)

By the way, I ran the Dell diagnostics available from the boot menu, but that did not result in any issue being detected. But I get the impression those diagnostics only include the Intel GPU and not the NVIDIA one.

Looks like broken gpu or video memory, you could check it using cuda-gpumemtest
https://github.com/ComputationalRadiationPhysics/cuda_memtest
and gpu-burn
https://github.com/wilicc/gpu-burn

Ran a first set of cuda_memtest and gpu_burn runs for a few hours, all without errors. However, starting a regular GPU application caused a few more lockups directly after those runs. Plus the latest gpu-burn run after a reboot now fairly quickly ran into an issue and reports an error, plus the usual errors in system journal:

May 07 16:15:38 l0420007 kernel: NVRM: GPU at PCI:0000:01:00: GPU-abec0481-1b64-1815-d98e-6b2a688a093d
May 07 16:15:38 l0420007 kernel: NVRM: GPU Board Serial Number: 
May 07 16:15:38 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 61, pid=703, 0d20(31cc) 00000000 00000000
May 07 16:15:38 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1215, Ch 00000010
May 07 16:15:38 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1215, Ch 00000011
May 07 16:15:38 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1215, Ch 00000012
May 07 16:15:38 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1215, Ch 00000013
May 07 16:15:38 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1215, Ch 00000014
May 07 16:15:38 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1215, Ch 00000015
May 07 16:15:38 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1215, Ch 00000016
May 07 16:15:38 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1215, Ch 00000017

And a second gpu_burn run, after a reboot, also quickly gives (a different) error:

May 07 16:23:45 l0420007 kernel: NVRM: GPU at PCI:0000:01:00: GPU-abec0481-1b64-1815-d98e-6b2a688a093d
May 07 16:23:45 l0420007 kernel: NVRM: GPU Board Serial Number: 
May 07 16:23:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=607, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 1): Misaligned Address
May 07 16:23:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=607, Graphics SM Global Exception on (GPC 0, TPC 0, SM 1): Multiple Warp Errors
May 07 16:23:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=607, Graphics Exception: ESR 0x5047b0=0x500000f 0x5047b4=0x24 0x5047a8=0x4c1eb72 0x5047ac=0x174
May 07 16:23:45 l0420007 kernel: NVRM: Xid (PCI:0000:01:00): 43, pid=1115, Ch 00000010

Given the different Xid codes reported when these errors occur I’m really curious what to make of this.

It’s broken, just have it replaced by Dell.