Hung/frozen machine with X370 board, GTX 1060 card, Ryzen 5 CPU - Xid 32 & 69 - all driver versions

I get rare freeze-ups on my current machine. They occur every few days, with no log entry and no way to recover.

I frequently (every boot) get Xid errors 32 and 69. I believe these are related. If I am reading the Xid documentation correctly, these two Xids can only be caused by driver issues. I sometimes get other Xid errors, but those logs have been truncated - I will keep on the lookout for them.

I believe this is caused by some race condition - the issue gets better (crashes/flickering/slowness/Xid’s are rarer) when I turn on maximum performance instead of auto performance. The issue is worse when I play games or other GPU-intensive activities. WebGL sometimes crashes, though video games seem to recover pretty well?

The output of nvidia-bug-report.sh is attached.

The machine is used for only a few hours a day, but it is mostly online - I can run experiments or other code to attempt to reproduce the issue if someone can send me the code. I am familiar with C/C++ development. I can also run things in any kind of super-debug mode if someone can tell me how. I have not setup a machine to receive the logs (a remote logger), because I am unsure if that will help - I doubt they get out of the buffer before the kernel hangs. The kernel module should be in persistence mode. I am running Arch linux and the latest kernel/nvidia driver.

What I get from dmesg | grep -i ‘nvrm’ is something like:

Sep 25 02:02:18 RockCruncher kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.69 Wed Aug 16 19:34:54 PDT 2017 (using threaded interrupts)
Sep 25 02:02:19 RockCruncher kernel: NVRM: Your system is not currently configured to drive a VGA console
Sep 25 17:34:47 RockCruncher kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.69 Wed Aug 16 19:34:54 PDT 2017 (using threaded interrupts)
Sep 25 17:34:49 RockCruncher kernel: NVRM: Your system is not currently configured to drive a VGA console
Sep 25 18:40:57 RockCruncher kernel: NVRM: GPU at PCI:0000:0c:00: GPU-bc43403c-41f2-3d53-37da-dd090bfda690
Sep 25 18:40:57 RockCruncher kernel: NVRM: GPU Board Serial Number:
Sep 25 18:40:57 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 18:41:13 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:20:58 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:21:03 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 69, Class Error: ChId 001b, Class 0000c197, Offset 00001688, Data 00008000, ErrorCode 0000000c
Sep 25 19:21:09 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:27:23 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:42:46 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:43:15 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:43:22 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:43:24 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 19:59:13 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 20:54:24 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000
Sep 25 23:37:29 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 0000238c HCE_DBG1 00000020
Sep 25 23:37:29 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000345
Sep 25 23:49:56 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr 00040000
Sep 25 23:50:44 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr 00040000
Sep 26 00:35:58 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 0000238c HCE_DBG1 00000020
Sep 26 00:35:58 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000345
Sep 26 01:08:19 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000000
Sep 26 01:08:19 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002394 HCE_DBG1 00000000
Sep 26 01:40:01 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000000
Sep 26 01:40:01 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002394 HCE_DBG1 00000000

nvidia-bug-report.log.gz (161 KB)

Looks more like a hardware related error. Corrupt PCI transfers, maybe flaky/overheating system memory or slot problems, maybe reseat the card and remove/swap memory.

I was really hoping that would be the answer - I reseated it in a different PCI slot (from a x16 to an x8) and I moved the memory cards around. The issues went away for a while, but today I got the following:

[98800.039057] NVRM: GPU at PCI:0000:0c:00: GPU-bc43403c-41f2-3d53-37da-dd090bfda690
[98800.039063] NVRM: GPU Board Serial Number:
[98800.039067] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr1 00000008 HCE_DBG0 00000c00 HCE_DBG1 03bc0000
[98800.039355] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr1 00000008 HCE_DBG0 00000c04 HCE_DBG1 04100000

[178560.577764] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr1 00000008 HCE_DBG0 00000c00 HCE_DBG1 00010000
[178560.577989] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr1 00000008 HCE_DBG0 00000c04 HCE_DBG1 00010000
[269344.241391] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr 00040000
[349981.172633] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr 00040000

It could be that my motherboard is bad, or the GPU is bad, or something else. I am not sure if I am in the position to replace any of them, but I would like to replace the most likely to be the problem first.

Really hard to tell. The first task would be to come up with a procedure to quickly reproduce the failure. e.g. does running some Unigine demo in benchmark mode trigger the issue? Let it run for 30min and see. Then maybe test the graphics card in an otherwise working system using the test procedure.

My first runs with unigine-heaven did not produce any errors. Subsequent runs sometimes produce errors (like the last one, which is included below). I have also tried some cuda code that is meant to measure the pci express connections throughput - I was able to get a bunch of Xid 31’s, but that is due to my picking the wrong transfer size afaict. I also tried https://sourceforge.net/projects/cudagpumemtest/ which did give me one Xid 32, but I could not reproduce the issue.

unigine-heaven seems to give me the issue most consistently, but its still not very reproducible. I have heard rumors that the mother board I have has some subtle PCIE timing issue, but only in the context of GPU passthrough. I don’t have another machine handy that I could use to test against, but I may have to dig one up…

Do any of these experiments isolate any part of the system as not being the issue?

Latest batch below:

[37347.095914] NVRM: GPU at PCI:0000:0c:00: GPU-bc43403c-41f2-3d53-37da-dd090bfda690
[37347.095918] NVRM: GPU Board Serial Number:
[37347.095922] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000018 intr1 00000008 HCE_DBG0 0000238c HCE_DBG1 00000200
[37347.096180] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000018 intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 40000000
[37385.630808] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000018 intr1 00000008 HCE_DBG0 000017c8 HCE_DBG1 00000001
[37385.631026] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000018 intr1 00000008 HCE_DBG0 000017cc HCE_DBG1 3eb73ce0
[116930.777702] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000018 intr 00040000
[116966.641763] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000018 intr 00040000

The rumors you heard are probably those about the X399/Threadripper chipset, at least I don’t know about general problems with the X370/Ryzen chipset.

Maybe do another run with Unigine but this time monitor temperatures using sensors and nvidia-smi. Perhaps there’s some overheating involved.

Other than that, there’s only left to get another system where you can test your gpu in.

Maybe also try this. It has a bootable image included
http://mikelab.kiev.ua/index_en.php?page=PROGRAMS/vmt_en

Did you check if you’re affected by the early Ryzen’s bug?
https://github.com/suaefar/ryzen-test

I monitored the read-out on the unigine display - it peaked at ~82C, and I have setup my machine to log the temperature every 5 minutes, which I am hoping will let me see a correlation between temp and Xids.

I haven’t had a chance to try the (floppy!?) bootable image yet - I am concerned the tech may have bit rotted, but I need a chance to look at it.

That being said, its probably all moot, since I am definitely getting SEGV’s from the ryzen test. I am contacting AMD now to get a replacement. If you can confirm that the CPU issue can cause those Xid’s, I will accept that as the answer (unless AMD gives me some test that shows it isn’t the CPU, in which case its on to the motherboard and RAM (again)).

Thank you so much for your help!

Luckily, another user reported back that the strange XIDs he was getting on CUDA reruns were caused by the defective Ryzen CPU. Completely different XIDs but then, that user only had CUDA workloads, no graphics involved.
Thinking about it, your XID 32 means corrupt push buffer, the chain in which corruption can occur boils down to
video memory
pcie
system memory
cpu
Taken into account that the Ryzen bug is about data corruption, I’d say chances are high.
IIRC, the Ryzen bug can be worked around to a certain degree by disabling hyperthreading in bios.
Regardless, without an otherwise working cpu all tests are naught, anyway.

I wanted to thank you again so much for the help.

I am still waiting on AMD to replace the CPU, but I have disabled SMT in the meantime - I am still getting XIDs, but I assume they are just due to the bad CPU. I am posting just to give more data/record it so I can compare with the new CPU whenever it happens to arrive.

Here is the output of dmesg | grep -i ‘nv’
[ 5.288993] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 387.12 Thu Sep 28 20:18:48 PDT 2017 (using threaded interrupts)
[ 6.847520] NVRM: Your system is not currently configured to drive a VGA console

These were induced by unigine-heaven (one process was killed and then the following one gave XID 32s -I am not sure if this is a valid test):
[11272.365122] NVRM: GPU at PCI:0000:0c:00: GPU-bc43403c-41f2-3d53-37da-dd090bfda690
[11272.365133] NVRM: GPU Board Serial Number:
[11272.365138] NVRM: Xid (PCI:0000:0c:00): 69, Class Error: ChId 0010, Class 0000c197, Offset 000017c8, Data 00008001, ErrorCode 0000000c
[11273.188788] NVRM: Xid (PCI:0000:0c:00): 69, Class Error: ChId 0010, Class 0000c197, Offset 000017c8, Data 00008001, ErrorCode 0000000c
[12023.372691] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr1 00000008 HCE_DBG0 00001514 HCE_DBG1 00000000
[14324.607876] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr1 00000008 HCE_DBG0 00000f1c HCE_DBG1 00000000

This one is spontaneous (after an invalid opcode/segfault from chrome):
[161984.226944] NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr 00040000

Other user had same problem on his Ryzen system, too:
https://devtalk.nvidia.com/default/topic/1008759/
Don’t know if he ever found a solution.
Edit: contains a link to another issue.
vendor/chipset involved
MSI B350
Biostar B350
ASRock X370
GPUs
GTX1080
GTX1060
GTX760

For completeness, XIDs with CUDA and Ryzen:
https://devtalk.nvidia.com/default/topic/1024153/linux/issue-with-cuda-8-and-python-on-linux-ubuntu-16-04-03-kernel-13-1-/
Fixed after replacing CPU.

Sorry to bump an old issue. But I want to report this is still an issue. Kernel v4.20 and Nvidia driver 415.25 - also kernel v4.18 and Nvidia driver 410 on both a 970 and 1070. Ryzen 5 2600 and x370 Mobo.