[SOLVED - RMA] Freeze when gaming, multiple NVRM errors -Driver issues?

Hey,

I was trying to solve this by myself, but it’s over 2 months now and I’m out of ideas.
There are so many things I’ve tried - I completely lost count, but for the sake of providing logs and further detail, I’m more than happy to test anything again.

My Problem is, that after a couple of minutes playing certain games, my PC freezes. There is no way to switch to terminal or TTY, a hard reset is required to recover.
The most notable game is Warframe (proton). After only 5 minutes or so, the crash happens.
In EvE Online (proton), my monitor sometimes turns grey, no GUI, nothing - just grey.

I have tried booting into my old windows install and tested benchmarks and gaming, but everything was working fine.

I tried to ssh into my machine and started playing, while having journalctl -f running.
At the time of freeze, I received:

Nov 02 02:58:50 Ceetemus kernel: NVRM: GPU at PCI:0000:01:00: GPU-27f23ee2-fdae-0271-e491-038e6975f972
Nov 02 02:58:50 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000063

and…

Nov 02 03:00:22 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
Nov 02 03:00:22 Ceetemus kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Nov 02 03:00:22 Ceetemus kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                 NVRM: nvidia-bug-report.sh as root to collect this data before
                                 NVRM: the NVIDIA kernel module is unloaded.

I run the nvidia-bug-report.sh at that time. I will attach it.

I went ahead and searched for similar reports, I wanted to know if that was the cause of my frequent crashes or a one-time thing:

Sep 13 21:26:54 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 001b, Class 0000b197, Offset 000007e4, Data a0040eaa, ErrorCode 0000000c
Sep 13 21:32:55 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 001b, Class 0000b197, Offset 000007a4, Data 2004c004, ErrorCode 0000000c
Sep 13 21:42:01 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 0033, Class 0000b197, Offset 000007e4, Data a0040eaa, ErrorCode 0000000c
Sep 14 00:39:24 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 003b, Class 0000b197, Offset 000007e4, Data a0040eaa, ErrorCode 0000000c
Sep 15 17:04:11 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
Sep 21 02:44:46 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000043
Sep 27 23:12:27 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Class 0x3d8 Subchannel 0x0 Mismatch
Sep 27 23:12:27 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x4041b0=0x3d8
Sep 27 23:12:27 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x404000=0x80000002
Sep 27 23:12:27 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 003b, Class 0000b197, Offset 00001a2c, Data 00000000
Sep 27 23:12:27 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 32, Channel ID 0000003b intr 02000000
Sep 27 23:18:15 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 41, CCMDs 0000003b 0000b0b5
Sep 27 23:18:56 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 32, Channel ID 0000003b intr 00800000
Sep 27 23:18:56 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 32, Channel ID 0000003b intr 00800000
Sep 30 22:44:32 Ceetemus kernel: NVRM: Xid (PCI:0000:02:00): 31, Ch 00000044, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
Okt 01 21:28:57 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 16, Head 00000000 Count 00063192
Okt 01 21:29:05 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 16, Head 00000000 Count 00063193
Okt 01 21:29:13 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 16, Head 00000000 Count 00063194
Okt 01 21:29:21 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 16, Head 00000000 Count 00063195
Okt 20 00:34:54 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 0000001b
Okt 21 00:14:26 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000053
Okt 31 19:18:10 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
Okt 31 20:44:33 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 16, Head 00000000 Count 000010ea
Okt 31 22:19:01 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 8, pid=353, Channel 00000053
Okt 31 22:40:51 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 0000004b
Okt 31 23:19:24 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
Nov 01 00:29:36 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 31, Ch 00000053, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0xff_8836a000. Fault is of type FAULT_INFO_TYPE_UNSUPPORTED_KIND ACCESS_TYPE_READ
Nov 01 00:57:35 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000053
Nov 02 02:58:50 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000063
Nov 02 03:00:22 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.

So we are seeing
Xid 8, 13, 16, 31, 32, 41, 69, 79
According to https://docs.nvidia.com/deploy/pdf/XID_Errors.pdf
All these errors have “Driver issue” in common.

Here are a few lines before and after todays crash:

Nov 02 02:58:22 Ceetemus org_kde_powerdevil[959]: powerdevil: Can't contact ck
Nov 02 02:58:47 Ceetemus org_kde_powerdevil[959]: powerdevil: Releasing inhibition with cookie  2007
Nov 02 02:58:47 Ceetemus org_kde_powerdevil[959]: powerdevil: Restoring DPMS features after inhibition release
Nov 02 02:58:47 Ceetemus org_kde_powerdevil[959]: powerdevil: Scheduling inhibition from ":1.15" "My SDL application" with cookie 2008 and reason "Playing a game"
Nov 02 02:58:47 Ceetemus org_kde_powerdevil[959]: powerdevil: Can't contact ck
Nov 02 02:58:50 Ceetemus kernel: NVRM: GPU at PCI:0000:01:00: GPU-27f23ee2-fdae-0271-e491-038e6975f972
Nov 02 02:58:50 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000063
Nov 02 02:58:52 Ceetemus org_kde_powerdevil[959]: powerdevil: Enforcing inhibition from ":1.15" "My SDL application" with cookie 2008 and reason "Playing a game"
Nov 02 02:58:52 Ceetemus org_kde_powerdevil[959]: powerdevil: Added change screen settings
Nov 02 02:58:52 Ceetemus org_kde_powerdevil[959]: powerdevil: Added interrupt session
Nov 02 02:58:52 Ceetemus org_kde_powerdevil[959]: powerdevil: Disabling DPMS due to inhibition
Nov 02 02:58:52 Ceetemus org_kde_powerdevil[959]: powerdevil: Can't contact ck
q
Nov 02 03:00:01 Ceetemus CROND[31312]: (root) CMD (timeshift --check --scripted)
Nov 02 03:00:01 Ceetemus CROND[31311]: (root) CMDOUT ((process:31312): GLib-GIO-CRITICAL **: 03:00:01.172: g_file_get_path: assertion 'G_IS_FILE (file)' failed)
Nov 02 03:00:01 Ceetemus CROND[31311]: (root) CMDOUT ()
Nov 02 03:00:01 Ceetemus CROND[31311]: (root) CMDOUT (** (process:31312): CRITICAL **: 03:00:01.172: tee_jee_file_system_path_combine: assertion 'path1 != NULL' failed)
Nov 02 03:00:01 Ceetemus CROND[31311]: (root) CMDOUT ()
Nov 02 03:00:01 Ceetemus CROND[31311]: (root) CMDOUT (** (process:31312): CRITICAL **: 03:00:01.172: tee_jee_file_system_dir_exists: assertion 'dir_path != NULL' failed)
Nov 02 03:00:01 Ceetemus CROND[31311]: (root) CMDOUT (Daily snapshots are enabled)
Nov 02 03:00:01 Ceetemus CROND[31311]: (root) CMDOUT (Last daily snapshot is 6 hours old)
Nov 02 03:00:01 Ceetemus CROND[31311]: (root) CMDOUT (Monthly snapshot are enabled)
Nov 02 03:00:01 Ceetemus CROND[31311]: (root) CMDOUT (Last monthly snapshot is 28 days old)
Nov 02 03:00:01 Ceetemus CROND[31311]: (root) CMDOUT (------------------------------------------------------------------------------)
Nov 02 03:00:01 Ceetemus crontab[31344]: (root) LIST (root)
Nov 02 03:00:01 Ceetemus crontab[31345]: (root) LIST (root)
Nov 02 03:00:22 Ceetemus kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
Nov 02 03:00:22 Ceetemus kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Nov 02 03:00:22 Ceetemus kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                 NVRM: nvidia-bug-report.sh as root to collect this data before
                                 NVRM: the NVIDIA kernel module is unloaded.
Nov 02 03:00:22 Ceetemus org_kde_powerdevil[959]: powerdevil: Releasing inhibition with cookie  2008
Nov 02 03:00:22 Ceetemus org_kde_powerdevil[959]: powerdevil: Restoring DPMS features after inhibition release
Nov 02 03:00:22 Ceetemus org_kde_powerdevil[959]: powerdevil: Can't contact ck
Nov 02 03:00:35 Ceetemus kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
Nov 02 03:00:35 Ceetemus kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:1:0:0x0000000f
Nov 02 03:00:35 Ceetemus kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:2:0:0x0000000f
^C

I already tried other drivers before - including beta and older ones. As I said this has been going on for 2 months now.

I’ve also tried other distros. POP!OS, ManjaroXFCE. Same issues
Hardware is fine, everything runs great on windows.

What do I do?
My System is up to date.

Thank you for your time
-CT
nvidia-bug-report.log.gz (61.3 KB)

I want to add that, in certain situations, my 980ti will be rather loud. It sounds like coil whine, but I’ve been gaming for one year+ on windows and NEVER heard my GPU having coil whine.

With GLMark2, most of the tests in --fullscreen make audible noise.
When I was testing Warframe (Wine), I encountered the same noises (coil whine?), too.

Just wanted to point that out, as with window$, I’ve never had that.

XID 79 + coil whine points to problems with the psu. You can ignore the other XIDs, those are just subsequent errors. The linux driver is upclocking more aggressively than the windows driver so you didn’t encounter it before. A bios update would be advisable, too.

Thank you very much for your work on this forum.

I did a BIOS update and tried other cables for the GPU and a different VGA port on my PSU.
That didn’t fix it unfortunately.

It’ll be a while before I can get my hands on another PSU to test your first idea. Is there anything else that I can try in the meantime?

Reseating the card in the slot would be worth a try.
You could monitor temperature using
nvidia-smi -q -d TEMPERATURE -l 2 >nvtemp.log
and after crashing, check the log.

nvidia-smi -q -d TEMPERATURE -l 2 >nvtemp.log

==============NVSMI LOG==============

Timestamp                           : Thu Nov  7 22:53:20 2019
Driver Version                      : 435.21
CUDA Version                        : 10.1

Attached GPUs                       : 1
GPU 00000000:02:00.0
    Temperature
        GPU Current Temp            : 29 C
        GPU Shutdown Temp           : 97 C
        GPU Slowdown Temp           : 92 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

It’s watercooled (AIO)… would’ve been a surprise to see it above 40C

I reseated the card multiple times now. Also tried the other PCIe slot.
I’ve also tried another identical GPU (I had a 2-way SLI with EVGA 980ti Hybrids, but since sold one, because SLI and multiple monitors in Linux didn’t work).

No idea how I’m gona acquire another PSU for testing, but I’ll figure something out.

I’m very curious how stuff works… Would you be willing to explain how games like Warframe freeze, while I get a ‘greyscreen’ (no freeze) in EvE Online, but Rebel Galaxy (2015) works just fine for hours and hours?

That’s rather impossible since the XIDs are giving hints but not complete explanations. You were getting different XIDs at different times (games), probably with different driver versions.
Looks like Warframe is reliably triggering XID79, which is the most telling one.
You also got XID 31 which might be a OOM situation with VMEM in Vulkan, should be fixed in the 440 driver (which introduced other issues, btw). So maybe the game that triggered that works now.
XID 32 can, among other things, point to a problem with system memory, the nvidia driver is very sensitive to that.
So, different games, different problems.

Update:
It’s not every game that causes issues. I got the most consistent crashes with the game “Warframe” by Digital Extremes.
Weirdly enough at some point I said fuck it and booted to windows. Got complete PC freeze there too, now. (I haven’t tried windows for a very long time, as I wanted to exclusively game on Linux. Before I made up my mind switching to Linux, everything was working fine)
In my original post I said everything was working fine on Windows, while I get crashes on Linux- In conclusion it sounds weird, but as I said not every game causes issues, so maybe I only tried games and benchmarks which would still work now - I can’t tell anymore)

Tested new PSU
Tested new cables

Got my hands on a RX580 and was gaming with absolutely no issues for about 3 weeks, 7h a day on linux (warframe and others)
Went to my friends house and tried my GTX980ti in his PC - game crashes (Warframe, Windows 10)
tried his RTX2070SUPER in my PC and was gaming for about 40 minutes - no issues. (Warframe, Windows 10)

Card is going RMA