Xid errors on GTX 1070 @ linux

This is a copy and paste from my original post here https://forums.geforce.com/default/topic/1078396/geforce-drivers/xid-errors-on-gtx-1070-linux/

The problem
What :Screen freezes, Xid code appears in system logs (kernel, xorg).
Which codes : 31, 13, 69, 32, 12, 32, 56.
When : Always, soon or later. Occurs faster on steam games.
Why : Driver. While on Windows I can play the exactly same games. Sometimes I get “device lost” or “device hung”.
Who : Galax Geforce GTX 1070 OC Mini 8GB GDDR5 256-Bit, S/N 70NSH6DVO5MN .
Where : GNU/Linux .
How : Left for Dead 2 in-game, Metro 2033 redux any part of the game, CS:GO in-game (less errors). Rarely with Unigine Valley.
How many times : For steam games : always. For the rest : sporadic.

System info
Platform : Ryzen, chipset X370 .
Driver version : 410.73 x86_64 (.run, ubuntu 16) .
Kernel : 4.18.0-2-amd64 (debian X86_64) .
libs : checked.
cuda : 10.0 .

Personal observations
CPU : I had to send my CPU (1800X) to the warranty for replacement, since I thought it could be an issue caused my the PCI-Ex controller. Also my old CPU (batch 21) presented segfault issues. The current batch is 35.
Chipset : I replaced my MOBO because first of all I thought it was an issue on the PCI-Ex slot or related. From a X370 Taichi to a X370GT7.
GPU : seems to work well, tested on CUDA applications and Unigine Valley runs fine most part of the time.
PSU : not an issue, tested on 2 different PSUs, Strike-X 800W Silver and Corsair AX860i Platinum.

Raw output (from dmesg, for example)
[ 4514.185733] NVRM: GPU at PCI:0000:09:00: GPU-3cafb039-8cf0-4c61-20ff-cc44042e1c48
[ 4514.185737] NVRM: GPU Board Serial Number:
[ 4514.185741] NVRM: Xid (PCI:0000:09:00): 31, Ch 00000030, engmask 00000101, intr 10000000
[ 4515.966830] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: MISSING_MACRO_DATA
[ 4515.966838] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: ESR 0x404490=0x80000001
[ 4515.966872] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: ChID 0030, Class 0000c197, Offset 00002390, Data 00000000
[ 5188.266947] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: EXTRA_MACRO_DATA
[ 5188.266958] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: ESR 0x404490=0x80000002
[ 5188.267001] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: ChID 0030, Class 0000c197, Offset 00001618, Data 00000007
[ 5932.412935] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: EXTRA_MACRO_DATA
[ 5932.412944] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: ESR 0x404490=0x80000002
[ 5932.412979] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: ChID 0030, Class 0000c197, Offset 00002390, Data 00000310
[ 5948.377628] NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 0030, Class 0000c197, Offset 00002388, Data 00fcb101, ErrorCode 00000004
[ 6010.341325] warning: process `metro’ used the deprecated sysctl system call with 10.1.
[ 6013.618188] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000043 intr 00040000
[ 6016.037934] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: MISSING_MACRO_DATA
[ 6016.037943] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: ESR 0x404490=0x80000001
[ 6016.037978] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: ChID 0043, Class 0000c197, Offset 0000342c, Data 00000001
[ 6024.162134] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000043 intr 00040000
[ 6028.145277] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000043 intr 00040000
[ 6033.169969] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000043 intr 00040000
[ 6040.681824] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000040 intr 00040000
[ 6045.169454] NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 0040, Class 0000c197, Offset 00000754, Data 00000000, ErrorCode 00000004
[ 6046.481502] NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 0040, Class 0000c197, Offset 00002388, Data 03ddbc01, ErrorCode 00000004
[ 6046.674237] NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 0040, Class 0000c197, Offset 00002380, Data 00000202, ErrorCode 00000004
[ 6046.985544] NVRM: Xid (PCI:0000:09:00): 12, Ch 00000040 Cl 0000c197 Off 00001928 Data 00000001
[ 6084.260382] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000040 intr 00040000
[ 6113.019498] NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 0040, Class 0000c197, Offset 0000238c, Data 00000002, ErrorCode 00000004
[ 6130.407651] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000040 intr 00040000
[ 6130.407786] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000040 intr 00040000
[ 6147.743023] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000040 intr 00040000
[ 6147.743158] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000040 intr 00040000
[ 6148.279135] NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 0040, Class 0000c197, Offset 00002040, Data 00016530, ErrorCode 0000000c
[ 6172.322880] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000040 intr 00040000

Adding more info:

IOMMU
Enabling IOMMU is not a solution.

“Left for Dead 2” : it delays part of the Xid Codes.
“Metro 2033 Redux” : it quickens the Xid Codes a lot.

RAM timings
Using either 1600MHz, 2133MHz, 2667MHz with either XMP or JEDEC results in the same Xid Codes.
Chips already tested and already sent for warranty. They found any problems and returned the chips.

Everything leads me to believe that it is an driver issue.

More raw outputs

IOMMU disabled, 2 different attempts
[ 6693.145473] NVRM: GPU at PCI:0000:09:00: GPU-3cafb039-8cf0-4c61-20ff-cc44042e1c48
[ 6693.145481] NVRM: GPU Board Serial Number:
[ 6693.145487] NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 003b, Class 0000c197, Offset 00001614, Data 00000000, ErrorCode 0000000d
[ 6696.212796] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 6719.437984] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 6742.542683] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 7922.685307] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000038 intr 00040000
[ 7922.892342] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000038 intr 00040000
[ 7923.099539] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000038 intr 00040000
[ 7923.475077] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000038 intr 00040000
[ 7923.794759] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000038 intr 00040000
[ 7928.012294] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000038 intr 00040000
[ 7929.984515] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000038 intr 00040000
[ 7934.455201] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000038 intr 00040000
[ 7938.531087] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000038 intr 00040000

IOMMU enabled, pay attention to the wider interval of failures while on the same attempt (first attempt was on l4d2, until 1705)
[ 663.479114] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000028 intr 00040000
[ 677.610775] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 716.549281] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000028 intr 00040000
[ 718.634529] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 741.763267] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 764.875447] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 784.997477] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 1503.091368] NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 002b, Class 0000c197, Offset 00000214, Data 00001011, ErrorCode 00000004
[ 1505.156677] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 1525.240677] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 1548.308032] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 1636.692415] NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 002b, Class 0000a140, Offset 000001b0, Data 00001001, ErrorCode 00000053
[ 1638.749038] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 1679.857352] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: EXTRA_MACRO_DATA
[ 1679.857362] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: ESR 0x404490=0x80000002
[ 1679.857418] NVRM: Xid (PCI:0000:09:00): 13, Graphics Exception: ChID 002b, Class 0000c197, Offset 00002380, Data 00007000
[ 1681.927364] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 1705.014863] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 1972.628439] NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 002b, Class 0000c197, Offset 00001538, Data 00000002, ErrorCode 0000000c
[ 1974.730907] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 1997.827641] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034
[ 2020.910600] NVRM: Xid (PCI:0000:09:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034

Also, cuda-memcheck reports nothing.

Though you already replaced the cpu and rma’d the memory, those XIDs look like the ryzen bug or faulty memory.
Please run the kill-ryzen script to validate you have a working cpu sent back:
https://github.com/Oxalin/ryzen-test
To check the memory, don’t use memcheck or the like, those are unreliable. Remove all memory modules but one, check if the issue still appears, then check with the next memory module.

Already tried.

Even thought this script is designed for 16GB of RAM, I could try a few times with my 8GB (dual channel) and no segfault was found. I’m thinking about adapting it to 8GB or compiling the kernel like 32x in a row.

The very only one segfault-kind that I had (like 2-3 times only) was from libxul.so while using Firefox and Chromium, and it delays a lot to occur. I’m trying to reproduce it but I’m had no luck until now.

I did it once with the previous mobo, gonna try again with this one, thanks for the tip.

Well trying again (ryzen-test) I got build failed :

./kill-ryzen.sh 2 8

[KERN] – Logs begin at Mon 2018-10-29 18:37:53 -03. –
[KERN] Oct 29 19:06:38 desk kernel: NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000033 intr 00040000
[KERN] Oct 29 19:06:47 desk kernel: NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000033 intr 00040000
[KERN] Oct 29 19:06:47 desk kernel: NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000030 intr 00040000
[KERN] Oct 29 19:07:15 desk kernel: NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 0030, Class 0000c197, Offset 000017e4, Data 26a1ffff, ErrorCode 0000000d
[KERN] Oct 29 19:07:15 desk kernel: NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 0030, Class 0000c197, Offset 000017e4, Data 0be66fff, ErrorCode 0000000d
[KERN] Oct 29 19:07:35 desk kernel: NVRM: Xid (PCI:0000:09:00): 69, Class Error: ChId 0030, Class 0000c197, Offset 00000804, Data 07400002, ErrorCode 00000004
[KERN] Oct 29 19:07:35 desk kernel: NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000030 intr 00040000
[KERN] Oct 29 20:18:09 desk kernel: zram: Added device: zram0
[KERN] Oct 29 20:18:09 desk kernel: zram0: detected capacity change from 0 to 68719476736
[KERN] Oct 29 20:18:10 desk kernel: EXT4-fs (zram0): mounted filesystem with ordered data mode. Opts: discard
[loop-0] Mon Oct 29 20:18:55 -03 2018 start 0
[loop-1] Mon Oct 29 20:18:56 -03 2018 start 0
[loop-0] Mon Oct 29 20:22:49 -03 2018 build failed
[loop-0] TIME TO FAIL: 234 s
[loop-1] Mon Oct 29 20:22:49 -03 2018 build failed
[loop-1] TIME TO FAIL: 234 s

Gonna contact AMD again, I’m having serious headaches with this processor. I wish I had my old 5 years of usage being reliable FX8350 again.

On the last attempt, the only error that I found was related to insuficient memory due to the need of more than 8GB of RAM.

Just that the build failed does not mean the ryzen bug is hit. You’ll have to check the build log if a segfault is really the issue, could have failed due to out of memory.

Forgot to update here but I got errors like
[KERN] Jul 12 13:35:04 strider kernel: bash[11568]: segfault at 60 ip 0000000000435d7e sp 00007fff8106ee00 error 6 in bash[400000+100000]

In the meanwhile I tested the same GPU on another borrowed PC and it seems to work without xid codes. At least I couldn’t find a single one while testing for 3 hours.

I need to mitigate more about it.

Just something I didn’t notice before: do you use zram per default? That doesn’t go well with the nvidia driver.

Used with and without.

I made some changes to the original script, to allow non-root and non-zram executions.

Ah, ok, that was just for kill-ryzen, so forget about it.

Going to report the same sort of errors.
Kernel v4.20 and Nvidia driver 415.25 - also kernel v4.18 and Nvidia driver 410 on both a 970 and 1070. Ryzen 5 2600 and x370 Mobo

Same on core i5 9600k / 2x8gb ddr4 3000
I have teste driver 390, 418 and 430…
nvidia-bug-report.log.gz (1.07 MB)

Please open a new thread.
Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/