I get rare freeze-ups on my current machine. They occur every few days, with no log entry and no way to recover.
I frequently (every boot) get Xid errors 32 and 69. I believe these are related. If I am reading the Xid documentation correctly, these two Xids can only be caused by driver issues. I sometimes get other Xid errors, but those logs have been truncated - I will keep on the lookout for them.
I believe this is caused by some race condition - the issue gets better (crashes/flickering/slowness/Xid’s are rarer) when I turn on maximum performance instead of auto performance. The issue is worse when I play games or other GPU-intensive activities. WebGL sometimes crashes, though video games seem to recover pretty well?
The output of nvidia-bug-report.sh is attached.
The machine is used for only a few hours a day, but it is mostly online - I can run experiments or other code to attempt to reproduce the issue if someone can send me the code. I am familiar with C/C++ development. I can also run things in any kind of super-debug mode if someone can tell me how. I have not setup a machine to receive the logs (a remote logger), because I am unsure if that will help - I doubt they get out of the buffer before the kernel hangs. The kernel module should be in persistence mode. I am running Arch linux and the latest kernel/nvidia driver.
What I get from dmesg | grep -i ‘nvrm’ is something like:
Sep 25 02:02:18 RockCruncher kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.69 Wed Aug 16 19:34:54 PDT 2017 (using threaded interrupts)
Sep 25 02:02:19 RockCruncher kernel: NVRM: Your system is not currently configured to drive a VGA console
Sep 25 17:34:47 RockCruncher kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.69 Wed Aug 16 19:34:54 PDT 2017 (using threaded interrupts)
Sep 25 17:34:49 RockCruncher kernel: NVRM: Your system is not currently configured to drive a VGA console
Sep 25 18:40:57 RockCruncher kernel: NVRM: GPU at PCI:0000:0c:00: GPU-bc43403c-41f2-3d53-37da-dd090bfda690
Sep 25 18:40:57 RockCruncher kernel: NVRM: GPU Board Serial Number:
Sep 25 18:40:57 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 18:41:13 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:20:58 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:21:03 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 69, Class Error: ChId 001b, Class 0000c197, Offset 00001688, Data 00008000, ErrorCode 0000000c
Sep 25 19:21:09 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:27:23 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:42:46 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:43:15 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:43:22 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:43:24 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:59:13 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 20:54:24 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 23:37:29 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 0000238c HCE_DBG1 00000020
Sep 25 23:37:29 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000345
Sep 25 23:49:56 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr 00040000
Sep 25 23:50:44 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr 00040000
Sep 26 00:35:58 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 0000238c HCE_DBG1 00000020
Sep 26 00:35:58 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000345
Sep 26 01:08:19 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000000
Sep 26 01:08:19 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002394 HCE_DBG1 00000000
Sep 26 01:40:01 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000000
Sep 26 01:40:01 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002394 HCE_DBG1 00000000
nvidia-bug-report.log.gz (161 KB)