Kernel Panics on CentOS7 - Geforce GTX 1080 Ti with Nvidia Driver 384.59

Installed Nvidia Driver 384.59 on CentOS7.
nouveau driver is blacklisted on the 2 servers with 8 GPUs (GTX 1080 Ti) each and CUDA 8.0

When running GPU application - Relion on both of these servers. The application/server crashes with the following kernel panics.

One of the servers:
[ 1017.193429] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
[ 1017.201319] IP: [] _nv024943rm+0x21/0x90 [nvidia]
[ 1017.207800] PGD 0
[ 1017.209839] Oops: 0000 [#1] SMP
[ 1017.213106] Modules linked in: nvidia_uvm(POE) 8021q garp mrp xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 nfsv4 dns_resolver nfsv3 nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_support snd_hda_intel
[ 1017.285005] snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore sb_edac mei_me pcspkr i2c_i801 edac_core sg lpc_ich mei shpchp ipmi_devintf ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c mlx4_en sd_mod crc_t10dif crct10dif_generic nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) ast drm_kms_helper syscopyarea sysfillrect sysimgblt ttm fb_sys_fops igb mxm_wmi ptp crct10dif_pclmul ahci crct10dif_common pps_core drm mlx4_core crc32c_intel libahci dca i2c_algo_bit libata i2c_core devlink fjes wmi dm_mirror dm_region_hash dm_log dm_mod
[ 1017.341506] CPU: 10 PID: 4301 Comm: relion_refine_m Tainted: P OE ------------ 3.10.0-514.26.2.el7.x86_64 #1
[ 1017.352479] Hardware name: Supermicro SYS-4028GR-TR/X10DRG-O±CPU, BIOS 2.0b 04/19/2017
[ 1017.360496] task: ffff882fb35f3ec0 ti: ffff882fa6a7c000 task.ti: ffff882fa6a7c000
[ 1017.367992] RIP: 0010:[] [] _nv024943rm+0x21/0x90 [nvidia]
[ 1017.376832] RSP: 0018:ffff882fa6a7f9e8 EFLAGS: 00010202
[ 1017.382153] RAX: ffff882fbee67e30 RBX: ffff882faf578430 RCX: 0000000000000000
[ 1017.389295] RDX: 0000000000000002 RSI: 0000000000897bf7 RDI: ffff882fbee67e48
[ 1017.396438] RBP: ffff882fae252df8 R08: ffff882fbb7a3000 R09: ffff882fae252ec0
[ 1017.403578] R10: 00000000bb7a2a01 R11: ffffea00beede800 R12: ffff882faf578430
[ 1017.410721] R13: ffff882fae252ec0 R14: ffff882fbd695910 R15: ffff882fbd695810
[ 1017.417861] FS: 00002b1491d54700(0000) GS:ffff882fbfc80000(0000) knlGS:0000000000000000
[ 1017.425966] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1017.431719] CR2: 0000000000000002 CR3: 00000000019be000 CR4: 00000000003407e0
[ 1017.438862] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1017.446004] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1017.453152] Stack:
[ 1017.455172] ffff882fbd695810 ffffffffa0c63eb0 ffff882faf578430 ffff882fbd695810
[ 1017.462644] ffff882fae252ec0 ffff882fbd695910 ffff882fbc772128 ffffffffa0c637b3
[ 1017.470106] 0000000000000000 00000000c1d00094 ffff882fbd695810 ffff882fbc772128
[ 1017.477568] Call Trace:
[ 1017.480135] [] ? _nv010627rm+0x250/0x2d0 [nvidia]
[ 1017.486603] [] ? _nv010624rm+0x93/0xc0 [nvidia]
[ 1017.493784] [] ? _nv010605rm+0x122/0x200 [nvidia]
[ 1017.501137] [] ? _nv007711rm+0x41/0xd0 [nvidia]
[ 1017.508289] [] ? _nv032610rm+0x69/0xb0 [nvidia]
[ 1017.515629] [] ? _nv007589rm+0x34/0x60 [nvidia]
[ 1017.522947] [] ? _nv007588rm+0x1f7/0x280 [nvidia]
[ 1017.530421] [] ? _nv001153rm+0x62/0xc0 [nvidia]
[ 1017.537723] [] ? rm_free_unused_clients+0xc1/0xf0 [nvidia]
[ 1017.545960] [] ? nvidia_close+0x222/0x3b0 [nvidia]
[ 1017.553495] [] ? nvidia_frontend_close+0x2c/0x50 [nvidia]
[ 1017.561572] [] ? __fput+0xe9/0x260
[ 1017.567636] [] ? ____fput+0xe/0x10
[ 1017.573686] [] ? task_work_run+0xc4/0xe0
[ 1017.580248] [] ? do_exit+0x2d8/0xa40
[ 1017.586461] [] ? poll_select_copy_remaining+0x150/0x150
[ 1017.594304] [] ? do_group_exit+0x3f/0xa0
[ 1017.600832] [] ? get_signal_to_deliver+0x1d0/0x6d0
[ 1017.608208] [] ? do_signal+0x57/0x6c0
[ 1017.614421] [] ? do_notify_resume+0x5f/0xb0
[ 1017.621130] [] ? int_signal+0x12/0x17
[ 1017.627297] Code: c0 eb ac 0f 1f 80 00 00 00 00 48 83 ec 08 48 85 ff 74 53 48 8b 17 31 c9 48 85 d2 75 0e eb 6b 48 89 d1 48 8b 52 10 48 85 d2 74 11 <48> 39 32 77 ef 74 25 48 8b 52 18 48 85 d2 90 75 ef 48 85 c9 74
[ 1017.648939] RIP [] _nv024943rm+0x21/0x90 [nvidia]
[ 1017.656263] RSP
[ 1017.660561] CR2: 0000000000000002

the other one.
[76862.383664] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[76862.383806] IP: [] _nv009833rm+0x48/0xc0 [nvidia]
[76862.383808] PGD 5f16b23067 PUD 5f16b22067 PMD 0
[76862.383809] Oops: 0000 [#1] SMP
[76862.383829] Modules linked in: nvidia_uvm(POE) 8021q garp mrp xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptab
le_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtab
le_filter ebtables ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 nfsv4 dns_resolver nfsv3 nfs fscache rpcrdma ib_isert iscs
i_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_support snd_hda_intel
[76862.383846] snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd mei_me sb_edac soundcore sg edac_core pcspkr i2c_i801 lpc_ich mei shpchp ipmi_devintf ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c mlx4_en sd_mod crc_t10dif crct10dif_generic nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) igb ast crct10dif_pclmul ptp drm_kms_helper mlx4_core crct10dif_common syscopyarea pps_core mxm_wmi crc32c_intel sysfillrect ttm sysimgblt dca ahci fb_sys_fops devlink i2c_algo_bit libahci drm libata i2c_core fjes wmi dm_mirror dm_region_hash dm_log dm_mod
[76862.383849] CPU: 21 PID: 12235 Comm: nvidia-smi Tainted: P OE ------------ 3.10.0-514.26.2.el7.x86_64 #1
[76862.383850] Hardware name: Supermicro SYS-4028GR-TR/X10DRG-O±CPU, BIOS 2.0b 04/19/2017
[76862.383850] task: ffff882fb9338000 ti: ffff885f25734000 task.ti: ffff885f25734000
[76862.383922] RIP: 0010:[] [] _nv009833rm+0x48/0xc0 [nvidia]
[76862.383922] RSP: 0018:ffff885f25737970 EFLAGS: 00010292
[76862.383923] RAX: 0000000000000000 RBX: ffff882fa3010008 RCX: ffff885f1bbfdf28
[76862.383923] RDX: ffff885f1bbfdf14 RSI: 000000006e76726d RDI: 0000000000000000
[76862.383924] RBP: ffff885f1bbfdf08 R08: ffff885f1bbfdf18 R09: 0000000000001000
[76862.383924] R10: 0000000000000000 R11: ffffffffa12b1a90 R12: ffff882fa3010008
[76862.383925] R13: 0000000000000000 R14: ffff885f1bbfdf28 R15: ffff885ee9398c08
[76862.383926] FS: 00002b49604e1280(0000) GS:ffff885fbf440000(0000) knlGS:0000000000000000
[76862.383926] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[76862.383927] CR2: 0000000000000008 CR3: 0000005f16b20000 CR4: 00000000003407e0
[76862.383927] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[76862.383928] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[76862.383928] Stack:
[76862.383929] 0000000000001000 0000000000000000 0000000000000000 0000000000000000
[76862.383930] ffff882fa3010008 ffff885ee9f90008 ffff885ee9398c08 ffff882f1b4025f0
[76862.383931] 0000000000000001 ffffffffa12adebe ffff885ee9f90008 ffff882f1b402008
[76862.383931] Call Trace:
[76862.384062] [] ? _nv018821rm+0x7e/0x1b0 [nvidia]
[76862.384191] [] ? _nv018817rm+0x20/0x50 [nvidia]
[76862.384319] [] ? _nv020133rm+0x1b0/0x730 [nvidia]
[76862.384426] [] ? _nv020392rm+0x6e/0xd0 [nvidia]
[76862.384509] [] ? _nv001185rm+0x2d0/0x380 [nvidia]
[76862.384593] [] ? _nv001180rm+0x35b/0x6e0 [nvidia]
[76862.384679] [] ? rm_init_adapter+0x128/0x130 [nvidia]
[76862.384683] [] ? try_to_wake_up+0x2c0/0x340
[76862.384751] [] ? nv_open_device+0x200/0x770 [nvidia]
[76862.384754] [] ? kmem_cache_alloc+0x193/0x1e0
[76862.384821] [] ? nvidia_open+0x14c/0x300 [nvidia]
[76862.384890] [] ? nvidia_frontend_open+0x52/0xb0 [nvidia]
[76862.384893] [] ? chrdev_open+0xa1/0x1e0
[76862.384895] [] ? do_dentry_open+0x1a7/0x2e0
[76862.384898] [] ? security_inode_permission+0x1c/0x30
[76862.384899] [] ? cdev_put+0x30/0x30
[76862.384900] [] ? vfs_open+0x5f/0xe0
[76862.384902] [] ? may_open+0x68/0x110
[76862.384903] [] ? do_last+0x1ed/0x12a0
[76862.384906] [] ? release_pages+0x24e/0x430
[76862.384907] [] ? path_openat+0xc2/0x490
[76862.384908] [] ? user_path_at_empty+0x72/0xc0
[76862.384909] [] ? do_filp_open+0x4b/0xb0
[76862.384911] [] ? __alloc_fd+0xa7/0x130
[76862.384913] [] ? do_sys_open+0xf3/0x1f0
[76862.384914] [] ? SyS_open+0x1e/0x20
[76862.384917] [] ? system_call_fastpath+0x16/0x1b
[76862.384928] Code: 10 48 8d 55 0c 44 89 4d 0c 4c 8b ae 28 07 00 00 4c 8d 45 10 44 8b 8e 80 07 00 00 6a 00 be 6d 72 76 6e 6a 00 4c 89 ef 6a 00 41 51 <41> ff 55 08 48 83 c4 20 85 c0 89 c3 75 26 4d 8b 47 10 45 31 c9
[76862.385000] RIP [] _nv009833rm+0x48/0xc0 [nvidia]
[76862.385000] RSP
[76862.385000] CR2: 0000000000000008

However, the same application runs fine on GTX 1080 cards, with CUDA 8.0 and 375.39.

Attached the output logs from nvidia-bug-report.sh.

I would appreciate any help with the above issues.
hpc02-nvidia-bug-report.log.gz (683 KB)
hpc01-nvidia-bug-report.log.gz (630 KB)

I have experienced also similar crash with this driver version 384.59, I use kaldi ( toolkit speech recognition ) which uses cuda 8 for GPU computation,

and I use the same graphic card ( GTX 1080 Ti ), the crash occurs also on tesla K40m graphic card,

with an old driver version there is no problem,

so this 384.59 driver version is probably buggy

That’s really good to know.
May I know which older driver version did you use? Just so that I could try the same and verify.

I have tried the driver 375.74 , no problem for the moment

Make sure the backtrace you all observed is same to confirm its exact same issue? I see you have multiple GPU , Is the issue reproduce with one GPU only installed in server? Is any display connected to GPU and how? Are you running Linux in graphical mode , what desktop env gnome, kde or else? Is the issue reproduce without X or graphical desktop running?

From where I can get Relion . How to install and run it? How long it take to trigger this issue?
Can you attach crash dump , backtrace and dmesg generated as soon as kernel panic . I think it should be under /var/crash/ . Also reverting to any earlier drivers resolve this issue?

Confirming from my side, I have the same issue (Fedora 26, kernel 4.12.8, driver 384.69). It is triggered by attempting to launch GDM-Wayland. With Wayland disabled, X sessions work normally. Wayland seemed to work before.

Sep 01 17:22:25 lisolet kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
Sep 01 17:22:25 lisolet kernel: IP: drm_atomic_helper_disable_plane+0x49/0xa0 [drm_kms_helper]
Sep 01 17:22:25 lisolet kernel: PGD 7efbb5067
Sep 01 17:22:25 lisolet kernel: P4D 7efbb5067
Sep 01 17:22:25 lisolet kernel: PUD 0
Sep 01 17:22:25 lisolet kernel:
Sep 01 17:22:25 lisolet kernel: Oops: 0000 [#1] SMP
Sep 01 17:22:25 lisolet kernel: Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridg
Sep 01 17:22:25 lisolet kernel: mac80211 snd_hda_codec snd_hda_core snd_hwdep irqbypass snd_seq crct10dif_pclmul snd_seq_device crc32_pclmul snd_pcm eeepc_wmi ghash_clmulni_intel asus_wmi iTCO_wdt sparse_keymap intel_cstate iTCO_vendor_support snd_timer hci_ua
Sep 01 17:22:25 lisolet kernel: CPU: 0 PID: 1142 Comm: gnome-shell Tainted: P OE 4.12.9-300.fc26.x86_64 #1
Sep 01 17:22:25 lisolet kernel: Hardware name: System manufacturer System Product Name/Z170 PRO GAMING/AURA, BIOS 2003 09/19/2016
Sep 01 17:22:25 lisolet kernel: task: ffff9bc7788d2640 task.stack: ffffaeab49fb8000
Sep 01 17:22:25 lisolet kernel: RIP: 0010:drm_atomic_helper_disable_plane+0x49/0xa0 [drm_kms_helper]
Sep 01 17:22:25 lisolet kernel: RSP: 0018:ffffaeab49fbbb98 EFLAGS: 00010282
Sep 01 17:22:25 lisolet kernel: RAX: ffff9bc782a35980 RBX: ffff9bc781376400 RCX: 000000000000001e
Sep 01 17:22:25 lisolet kernel: RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffffffffc03f4895
Sep 01 17:22:25 lisolet kernel: RBP: ffffaeab49fbbbb0 R08: ffff9bc792ac9320 R09: ffff9bc778a55000
Sep 01 17:22:25 lisolet kernel: R10: ffffaeab49fbbac8 R11: 000000000001e548 R12: ffff9bc76d4df408
Sep 01 17:22:25 lisolet kernel: R13: ffffaeab49fbbd08 R14: ffffaeab49fbbd08 R15: ffff9bc78e9d4c00
Sep 01 17:22:25 lisolet kernel: FS: 00007fabe7c58ac0(0000) GS:ffff9bc7b6c00000(0000) knlGS:0000000000000000
Sep 01 17:22:25 lisolet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 01 17:22:25 lisolet kernel: CR2: 0000000000000088 CR3: 00000007f893a000 CR4: 00000000003406f0
Sep 01 17:22:25 lisolet kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 01 17:22:25 lisolet kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 01 17:22:25 lisolet kernel: Call Trace:
Sep 01 17:22:25 lisolet kernel: __setplane_internal+0x4f/0x260 [drm]
Sep 01 17:22:25 lisolet kernel: ? __enqueue_entity+0x6c/0x70
Sep 01 17:22:25 lisolet kernel: drm_mode_cursor_universal+0xf7/0x1d0 [drm]
Sep 01 17:22:25 lisolet kernel: drm_mode_cursor_common+0x177/0x1e0 [drm]
Sep 01 17:22:25 lisolet kernel: drm_mode_cursor2_ioctl+0xe/0x10 [drm]
Sep 01 17:22:25 lisolet kernel: drm_ioctl+0x213/0x4d0 [drm]
Sep 01 17:22:25 lisolet kernel: ? drm_mode_cursor_ioctl+0x60/0x60 [drm]
Sep 01 17:22:25 lisolet kernel: ? pick_next_task_fair+0x486/0x550
Sep 01 17:22:25 lisolet kernel: ? __switch_to+0x225/0x450
Sep 01 17:22:25 lisolet kernel: do_vfs_ioctl+0xa5/0x600
Sep 01 17:22:25 lisolet kernel: ? __schedule+0x23e/0x860
Sep 01 17:22:25 lisolet kernel: ? SyS_futex+0x13b/0x180
Sep 01 17:22:25 lisolet kernel: SyS_ioctl+0x79/0x90
Sep 01 17:22:25 lisolet kernel: entry_SYSCALL_64_fastpath+0x1a/0xa5
Sep 01 17:22:25 lisolet kernel: RIP: 0033:0x7fabdd3215e7
Sep 01 17:22:25 lisolet kernel: RSP: 002b:00007ffc12fc7328 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Sep 01 17:22:25 lisolet kernel: RAX: ffffffffffffffda RBX: 00007fabb80090f0 RCX: 00007fabdd3215e7
Sep 01 17:22:25 lisolet kernel: RDX: 00007ffc12fc7360 RSI: 00000000c02464bb RDI: 0000000000000008
Sep 01 17:22:25 lisolet kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Sep 01 17:22:25 lisolet kernel: R10: ffffffffffffffa0 R11: 0000000000000246 R12: 0000000000000000
Sep 01 17:22:25 lisolet kernel: R13: 0000562cf886aea0 R14: 0000000000000000 R15: 0000000000000001
Sep 01 17:22:25 lisolet kernel: Code: c0 74 75 4c 89 68 38 48 89 de 48 89 c7 49 89 c4 e8 dd ad e9 ff 48 3d 00 f0 ff ff 77 4d 48 83 78 08 00 74 10 48 8b 93 90 00 00 00 <48> 39 9a 88 00 00 00 74 3a 48 89 df 48 89 c6 e8 43 ff ff ff 85
Sep 01 17:22:25 lisolet kernel: RIP: drm_atomic_helper_disable_plane+0x49/0xa0 [drm_kms_helper] RSP: ffffaeab49fbbb98
Sep 01 17:22:25 lisolet kernel: CR2: 0000000000000088
Sep 01 17:22:25 lisolet kernel: —[ end trace 07f63b212b44379b ]—

Please see: https://github.com/NVIDIA/egl-wayland/issues/6#issuecomment-325852124
Related to: https://bugs.archlinux.org/task/54980

I have had a similar issue which hangs my system randomly.

  • It occurs with the GTX 1080, but not GTX 980. (These are not “Ti” models.)
  • It occurs with both CUDA 8 and CUDA 9.
  • One of two things happen from time to time: either a MCE, or a “GPU has fallen off the bus.”
  • In either case the system is completely hung and I have to cycle power.

My system is in fact the same model as Lohit’s in the original post: a Supermicro SYS-4028GR-TR, equipped with 8 GPUs.

More details:

  • We have four of these servers, and the hangs occur on every one of them.
  • The MCEs occur on different RAM chips every time. So it’s probably not bad RAM.
  • The RAM is underclocked quite a bit.
  • The hangs occur whether or not the system is under load.
  • OS is CentOS 7.3.

The difficulty is that the hangs are so random, with sometimes a week or more passing without a problem.

Here is an example “GPU has fallen off the bus”:

[539165.661708] NVRM: GPU at PCI:0000:86:00: GPU-535512ca-fe64-815e-4890-1dbae65dac5b
[539165.662802] NVRM: GPU Board Serial Number:
[539165.663671] NVRM: Xid (PCI:0000:86:00): 79, GPU has fallen off the bus.
[539165.663671]
[539165.665406] NVRM: GPU at 0000:86:00.0 has fallen off the bus.
[539165.666295] NVRM: GPU is on Board .
[539165.667189] NVRM: A GPU crash dump has been created. If possible, please run
[539165.667189] NVRM: nvidia-bug-report.sh as root to collect this data before
[539165.667189] NVRM: the NVIDIA kernel module is unloaded.
[539187.629529] BUG: unable to handle kernel NULL pointer dereference at 0000000000000190
[539187.630119] IP: [] _nv015669rm+0x1c6/0x2b0 [nvidia]
[539187.630836] PGD 0
[539187.631383] Oops: 0000 [#1] SMP
[539187.631931] Modules linked in: cyvrtop(POE) netconsole xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge nvidia_uvm(POE) ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter arc4 md4 nls_utf8 cifs dns_resolver intel_powerclamp snd_hda_codec_hdmi coretemp intel_rapl snd_hda_intel snd_hda_codec iosf_mbi snd_hda_core kvm_intel kvm snd_hwdep snd_seq snd_seq_device ses snd_pcm iTCO_wdt sb_edac snd_timer irqbypass enclosure iTCO_vendor_support ipmi_devintf edac_core snd i2c_i801 scsi_transport_sas ipmi_si pcspkr sg soundcore ipmi_msghandler lpc_ich mei_me mei ioatdma shpchp acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic 8021q garp stp llc mrp mxm_wmi nvidia_drm(POE) nvidia_modeset(POE) crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel nvidia(POE) aesni_intel lrw gf128mul glue_helper ablk_helper cryptd igb ast dca ptp megaraid_sas ttm pps_core drm_kms_helper i2c_algo_bit ahci syscopyarea sysfillrect libahci sysimgblt fb_sys_fops libata drm i2c_core scsi_transport_iscsi fjes wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cyvrtop]
[539187.640382] CPU: 5 PID: 25278 Comm: gpumaind Tainted: P OE ------------ 3.10.0-514.el7.x86_64 #1
[539187.641293] Hardware name: Supermicro SYS-4028GR-TR/X10DRG-O±CPU, BIOS 2.0 12/28/2015
[539187.642221] task: ffff880ed7a45e20 ti: ffff880edb57c000 task.ti: ffff880edb57c000
[539187.643162] RIP: 0010:[] [] _nv015669rm+0x1c6/0x2b0 [nvidia]
[539187.644259] RSP: 0018:ffff880edb57f9f8 EFLAGS: 00010246
[539187.645228] RAX: 0000000000000000 RBX: ffff880653b4dea0 RCX: 00000001fec0afff
[539187.646216] RDX: 00000001fec0a000 RSI: 0000000000000000 RDI: ffff880ec4754008
[539187.647209] RBP: ffff880653b4de68 R08: 0000000000000000 R09: 0000000000000001
[539187.648209] R10: 0000000002020008 R11: ffffffffa1396b70 R12: ffff880ec4754008
[539187.649219] R13: 0000000000000001 R14: 00000001fec0a000 R15: 0000000000001000
[539187.650236] FS: 00007f278c8d1940(0000) GS:ffff88085ff40000(0000) knlGS:0000000000000000
[539187.651267] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[539187.652306] CR2: 0000000000000190 CR3: 00000000019ba000 CR4: 00000000001407e0
[539187.653358] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[539187.654267] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[539187.655120] Stack:
[539187.655969] 0000000000000000 00000000001fec0a ffff880dd56e1008 ffff880653b4dff8
[539187.656853] 0000000000000000 ffffffffa10f0d30 ffff880dd56e1008 00000000001fec0a
[539187.657742] 0000000000000000 ffff880653b4dff8 ffff880d84b53a08 ffffffffa138c3ed
[539187.658635] Call Trace:
[539187.659622] [] ? _nv010109rm+0xb0/0x270 [nvidia]
[539187.660592] [] ? _nv016656rm+0x60d/0x650 [nvidia]
[539187.661564] [] ? _nv016706rm+0x20/0xc0 [nvidia]
[539187.662520] [] ? rm_gpu_ops_stop_channel+0x120/0x140 [nvidia]
[539187.663456] [] ? nvUvmInterfaceStopChannel+0x33/0x50 [nvidia]
[539187.664365] [] ? uvm_user_channel_stop+0x33/0x40 [nvidia_uvm]
[539187.665269] [] ? uvm_va_space_stop_all_user_channels+0x78/0xb0 [nvidia_uvm]
[539187.666178] [] ? uvm_va_space_destroy+0x74/0x3b0 [nvidia_uvm]
[539187.667083] [] ? uvm_release+0x11/0x20 [nvidia_uvm]
[539187.667982] [] ? __fput+0xe9/0x260
[539187.668872] [] ? ____fput+0xe/0x10
[539187.669756] [] ? task_work_run+0xc4/0xe0
[539187.670634] [] ? do_exit+0x2d8/0xa40
[539187.671501] [] ? drop_futex_key_refs.isra.13+0x35/0x70
[539187.672364] [] ? futex_wait+0x11d/0x280
[539187.673220] [] ? do_group_exit+0x3f/0xa0
[539187.674078] [] ? get_signal_to_deliver+0x1d0/0x6d0
[539187.674924] [] ? do_signal+0x57/0x6c0
[539187.675745] [] ? do_notify_resume+0x5f/0xb0
[539187.676545] [] ? int_signal+0x12/0x17
[539187.677320] Code: 0e 00 00 00 0f 84 d4 fe ff ff 48 8b 83 80 00 00 00 45 31 c0 48 85 c0 0f 85 c2 00 00 00 4c 89 f2 4b 8d 4c 3e ff 4c 89 c6 4c 89 e7 <41> ff 90 90 01 00 00 84 c0 8b 43 08 0f 94 c2 a9 00 00 00 01 0f
[539187.678987] RIP [] _nv015669rm+0x1c6/0x2b0 [nvidia]
[539187.679877] RSP
[539187.680611] CR2: 0000000000000190

And here is an example MCE event. These are in fact more common than the “GPU has fallen off the bus”:

146465.290615] mce: [Hardware Error]: CPU 12: Machine Check Exception: 5 Bank 17: be200000000c110a
146465.293218] mce: [Hardware Error]: RIP !INEXACT! 10: 2017-12-03 03:13:19{_raw_spin_lock_irqsave+0x47/0x60}2017-12-03 03:13:19
146465.295861] mce: [Hardware Error]: TSC 652abd608a510 2017-12-03 03:13:19ADDR 80d00000 2017-12-03 03:13:19MISC d4ffa81603500086 2017-12-03 03:13:19
146465.298512] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1512299597 SOCKET 0 APIC 1 microcode 39
146465.301181] mce: [Hardware Error]: Run the above through ‘mcelog --ascii’
146465.345658] mce: [Hardware Error]: Some CPUs didn’t answer in synchronization
146465.348307] mce: [Hardware Error]: Machine check: Processor context corrupt
146465.350999] Kernel panic - not syncing: Fatal machine check on current CPU