Elementary OS 5 / 390.77 drivers / GTX 970 - GPU has fallen off the bus

Summary: during gameplay (steam, XCOM2) the screen goes black and card seems to shut down, disconnects from monitor. The rest of the system works fine.

I am facing this issue repeatedly on various different distributions of Linux. I am a bit of a distro hopper and have seen the same problem over and over again. Last month it was on xubuntu, now I see it on elementary os 5.

I suspected that this was a thermal issue, so I ran the game with the case removed. Still crashed. So I ran nvsmi during gameplay and here is the last line before the crash:

==============NVSMI LOG==============

Timestamp                           : Fri Oct 19 21:58:21 2018
Driver Version                      : 390.77

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Temperature
        GPU Current Temp            : 77 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 91 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

=======

I will attach also the crash report file. I did find a new error after the last failure and will attach it as well.
nvidia-bug-report.log.gz (141 KB)
last-crash-details.log (212 KB)
nvtemp.log (28.2 KB)

uname -a
Linux homemachine 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Looks like a hardware issue, temperatures are fine.
Things to try

  • reseat the card, check in another slot, check in another system
  • remove all but one system memory module, test, then replace with the next module.
  • check psu.

Generix, did you look at the logs at all?

Replying with a cookie-cutter answer does not really get to the root cause of the issue. The hardware performs just fine under Windows.

If you look at the last-crash-details.log file you will see that there is clearly a stacktrace in the log. Here is a snippet:

Oct 19 21:58:22 homemachine kernel: Fixing recursive fault but reboot is needed!
Oct 19 21:58:22 homemachine kernel: ---[ end trace 48b974302ed95ae3 ]---
Oct 19 21:58:22 homemachine kernel: CR2: 0000000000000000
Oct 19 21:58:22 homemachine kernel: RIP:           (null) RSP: ffffb75c423ffe98
Oct 19 21:58:22 homemachine kernel: Code:  Bad RIP value.
Oct 19 21:58:22 homemachine kernel:  rewind_stack_do_exit+0x17/0x20
Oct 19 21:58:22 homemachine kernel:  ? kthread+0x121/0x140
Oct 19 21:58:22 homemachine kernel:  do_exit+0x2ec/0xb40
Oct 19 21:58:22 homemachine kernel:  ? task_work_run+0x9d/0xc0
Oct 19 21:58:22 homemachine kernel: Call Trace:
Oct 19 21:58:22 homemachine kernel: CR2: 0000000000000000 CR3: 00000001e400a006 CR4: 00000000001606e0
Oct 19 21:58:22 homemachine kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 19 21:58:22 homemachine kernel: FS:  0000000000000000(0000) GS:ffff9b759ed00000(0000) knlGS:0000000000000000
Oct 19 21:58:22 homemachine kernel: R13: ffff9b75858e4f38 R14: ffffffffb914d200 R15: 0000000000000000
Oct 19 21:58:22 homemachine kernel: R10: ffffffffb8c06a80 R11: ffffffffb915380d R12: ffff9b75858e4440
Oct 19 21:58:22 homemachine kernel: RBP: ffffb75c423ffed0 R08: 0000000000000000 R09: 0000000000000000
Oct 19 21:58:22 homemachine kernel: RDX: ffffb75c423ffec0 RSI: 0000000000000000 RDI: ffffb75c423ffec0
Oct 19 21:58:22 homemachine kernel: RAX: 0000000000000000 RBX: ffff9b75858e4440 RCX: ffffb75c423ffec0
Oct 19 21:58:22 homemachine kernel: RSP: 0018:ffffb75c423ffe98 EFLAGS: 00210286
Oct 19 21:58:22 homemachine kernel: RIP: 0010:          (null)
Oct 19 21:58:22 homemachine kernel: Hardware name: MSI MS-7851/Z97I AC (MS-7851), BIOS V4.8 06/02/2015
Oct 19 21:58:22 homemachine kernel: CPU: 4 PID: 971 Comm: irq/32-nvidia Tainted: P      D    OE    4.15.0-36-generic #39-Ubuntu
Oct 19 21:58:22 homemachine kernel:  mac_hid intel_smartconnect wmi acpi_pad sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq hid_generic usbhid hid ahci r8169 libahci mii video
Oct 19 21:58:22 homemachine kernel: Modules linked in: ccm rfcomm cmac bnep snd_hda_codec_hdmi binfmt_misc nls_iso8859_1 arc4 nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core mxm_wmi snd_hwdep intel_rapl x86_pkg_temp_thermal snd_pcm intel_powerclamp drm_kms_helper coretemp kvm_intel kvm iwlmvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_seq_midi pcbc snd_seq_midi_event mac80211 drm ipmi_devintf aesni_intel ipmi_msghandler joydev fb_sys_fops iwlwifi snd_rawmidi aes_x86_64 input_leds crypto_simd btusb glue_helper snd_seq btrtl cryptd btbcm intel_cstate btintel syscopyarea intel_rapl_perf sysfillrect bluetooth snd_seq_device mei_me sysimgblt lpc_ich mei snd_timer cfg80211 snd ecdh_generic soundcore shpchp
Oct 19 21:58:22 homemachine kernel: Oops: 0010 [#2] SMP PTI
Oct 19 21:58:22 homemachine kernel: PGD 0 P4D 0
Oct 19 21:58:22 homemachine kernel: IP:           (null)
Oct 19 21:58:22 homemachine kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
Oct 19 21:58:22 homemachine kernel: ---[ end trace 48b974302ed95ae2 ]---
Oct 19 21:58:22 homemachine kernel: CR2: 00002630000002f8
Oct 19 21:58:22 homemachine kernel: RIP: _nv018564rm+0x52/0x130 [nvidia] RSP: ffffb75c423ffd58
Oct 19 21:58:22 homemachine kernel: Code: 00 00 00 48 8b 52 18 4c 89 e3 48 83 c3 08 48 89 55 08 0f 84 da 00 00 00 49 8b 44 24 08 48 85 c0 0f 84 8f 00 00 00 48 89 c1 eb 56 <44> 8b 81 f8 02 00 00 45 8b 4c 24 04 48 8d 4d 08 48 89 ea 4c 89
Oct 19 21:58:22 homemachine kernel:  ? ret_from_fork+0x35/0x40
Oct 19 21:58:22 homemachine kernel:  ? SyS_exit_group+0x14/0x20
Oct 19 21:58:22 homemachine kernel:  ? do_syscall_64+0x73/0x130
Oct 19 21:58:22 homemachine kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
Oct 19 21:58:22 homemachine kernel:  ? irq_thread_check_affinity+0xe0/0xe0
Oct 19 21:58:22 homemachine kernel:  ? kthread+0x121/0x140
Oct 19 21:58:22 homemachine kernel:  ? irq_forced_thread_fn+0x70/0x70
Oct 19 21:58:22 homemachine kernel:  ? irq_thread+0x145/0x1a0
Oct 19 21:58:22 homemachine kernel:  ? irq_thread_fn+0x26/0x60
Oct 19 21:58:22 homemachine kernel:  ? nvidia_isr_kthread_bh+0x11/0x20 [nvidia]
Oct 19 21:58:22 homemachine kernel:  ? nvidia_isr_common_bh+0x3d/0x60 [nvidia]
Oct 19 21:58:22 homemachine kernel:  ? irq_finalize_oneshot.part.40+0xf0/0xf0
Oct 19 21:58:22 homemachine kernel:  ? rm_isr_bh+0x23/0x70 [nvidia]
Oct 19 21:58:22 homemachine kernel:  ? irq_finalize_oneshot.part.40+0xf0/0xf0
Oct 19 21:58:22 homemachine kernel:  ? _nv001174rm+0x10e/0x150 [nvidia]
Oct 19 21:58:22 homemachine kernel:  ? _nv026237rm+0x71/0xa0 [nvidia]
Oct 19 21:58:22 homemachine kernel:  ? _nv007249rm+0x1a8/0x290 [nvidia]
Oct 19 21:58:22 homemachine kernel:  ? _nv022452rm+0xb6b/0x1010 [nvidia]
Oct 19 21:58:22 homemachine kernel: Call Trace:
Oct 19 21:58:22 homemachine kernel: CR2: 00002630000002f8 CR3: 00000001e400a006 CR4: 00000000001606e0
Oct 19 21:58:22 homemachine kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 19 21:58:22 homemachine kernel: FS:  0000000000000000(0000) GS:ffff9b759ed00000(0000) knlGS:0000000000000000
Oct 19 21:58:22 homemachine kernel: R13: ffff9b758bcc0008 R14: 0000000000000000 R15: ffff9b7585654008
Oct 19 21:58:22 homemachine kernel: R10: 000000005bcab5ee R11: ffffffffc15a1ae0 R12: ffff9b7584e55d20
Oct 19 21:58:22 homemachine kernel: RBP: ffff9b7584e55ca0 R08: ffff9b7584e557d0 R09: ffff9b7584e55d20
Oct 19 21:58:22 homemachine kernel: RDX: 0000000000000000 RSI: ffff9b758bcc0008 RDI: ffff9b7585654008
Oct 19 21:58:22 homemachine kernel: RAX: 0000263000000000 RBX: ffff9b7584e55d28 RCX: 0000263000000000
Oct 19 21:58:22 homemachine kernel: RSP: 0018:ffffb75c423ffd58 EFLAGS: 00210246
Oct 19 21:58:22 homemachine kernel: RIP: 0010:_nv018564rm+0x52/0x130 [nvidia]

There is also this:

Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x515648=0xffffffff 0x515650=0xffffffff 0x515644=0xffffffff 0x51564c=0xffffffff
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 2, TPC 2): Timeout Error
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 2, TPC 2): ECC DED Error
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 2, TPC 2): ECC SEC Error
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 2, TPC 2): Physical Multiple Warp Errors
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 2, TPC 2): SM to SM fault
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x515224=0xffffffff 0x515228=0xffffffff 0x51522c=0xffffffff 0x515234=0xffffffff
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, NVRM: Graphics TEX Exception on (GPC 2, TPC 2):     TEX NACK / Page Fault
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, NVRM: Graphics TEX Exception on (GPC 2, TPC 2):    TEX APERTURE
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, NVRM: Graphics TEX Exception on (GPC 2, TPC 2):     TEX LAYOUT
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, NVRM: Graphics TEX Exception on (GPC 2, TPC 2):    TEX FORMAT
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x515224=0xffffffff 0x515228=0xffffffff 0x51522c=0xffffffff 0x515234=0xffffffff
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, NVRM: Graphics TEX Exception on (GPC 2, TPC 2):     TEX NACK / Page Fault
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, NVRM: Graphics TEX Exception on (GPC 2, TPC 2):    TEX APERTURE
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, NVRM: Graphics TEX Exception on (GPC 2, TPC 2):     TEX LAYOUT
Oct 19 21:58:22 homemachine kernel: NVRM: Xid (PCI:0000:01:00): 13, NVRM: Graphics TEX Exception on (GPC 2, TPC 2):    TEX FORMAT

I’m sorry if you feel it’s a ‘cookie cutter answer’ but the backtrace and all XIDs that come before the XID 79 are just the symptom that the driver is in the middle of processing and the gpu doesn’t answer any more until it notices that the gpu is gone (XID 79).
From my experience, XID 79 is a hardware issue.
While the fact that it’s working on Windows would vote against that, Linux handles things differently so I uphold my opinion of a hardware issue.