Unable to handle kernel NULL pointer dereference at 0000000000000f28

Hi,

I have a Xubuntu 16.04 based machine with Dual 1080 Ti, and I’m using driver 410.78 from ubuntu’s graphics-drivers ppa. I get kernel oops from time to time on NULL pointer dereference, with stack trace suggesting nvidia:

Mar 10 17:48:04 scbox24 kernel: [258870.511446] BUG: unable to handle kernel NULL pointer dereference at 0000000000000f28                                                                                                             
Mar 10 17:48:04 scbox24 kernel: [258870.512031] IP: _nv026973rm+0x1f/0x40 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.512403] PGD 0 P4D 0  
Mar 10 17:48:04 scbox24 kernel: [258870.512766] Oops: 0000 [#1] SMP PTI
Mar 10 17:48:04 scbox24 kernel: [258870.513119] Modules linked in: veth xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc wireguard(
Mar 10 17:48:04 scbox24 kernel: [258870.516831]  irqbypass videodev btusb snd_seq btrtl media intel_cstate snd_pcm btbcm cfg80211 snd_seq_device btintel nf_conntrack_ipv4 snd_timer bluetooth nf_defrag_ipv4 intel_rapl_perf eeepc_wm
Mar 10 17:48:04 scbox24 kernel: [258870.521278]  usbhid hid nvidia_drm(POE) nvidia_modeset(POE) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc nvidia(POE) aesni_intel aes_x86_64 crypto_simd glue_helper cryptd drm_kms_helpe
Mar 10 17:48:04 scbox24 kernel: [258870.523474] CPU: 11 PID: 9119 Comm: python Tainted: P         C OE    4.15.0-46-generic #49~16.04.1-Ubuntu
Mar 10 17:48:04 scbox24 kernel: [258870.524585] Hardware name: System manufacturer System Product Name/ROG STRIX X299-E GAMING, BIOS 1602 11/02/2018
Mar 10 17:48:04 scbox24 kernel: [258870.525877] RIP: 0010:_nv026973rm+0x1f/0x40 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.526446] RSP: 0018:ffffaa70cfa079a0 EFLAGS: 00010286
Mar 10 17:48:04 scbox24 kernel: [258870.527038] RAX: ffff9b2557d7a008 RBX: 0000000000000000 RCX: 0000000000000000
Mar 10 17:48:04 scbox24 kernel: [258870.527638] RDX: 0000000000000100 RSI: 00000000000090f1 RDI: ffff9b24e507c008
Mar 10 17:48:04 scbox24 kernel: [258870.528236] RBP: ffff9b213ea75b70 R08: 0000000000000000 R09: ffff9b213ea75bd0
Mar 10 17:48:04 scbox24 kernel: [258870.528839] R10: 0000000000000000 R11: ffffffffc091ebb0 R12: ffff9b24e507c008
Mar 10 17:48:04 scbox24 kernel: [258870.529453] R13: ffff9b213ea75bf0 R14: ffff9b236c4f7010 R15: ffff9b2557d7a008
Mar 10 17:48:04 scbox24 kernel: [258870.530056] FS:  00007fd9bdb3f700(0000) GS:ffff9b255f6c0000(0000) knlGS:0000000000000000
Mar 10 17:48:04 scbox24 kernel: [258870.530663] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 10 17:48:04 scbox24 kernel: [258870.531298] CR2: 0000000000000f28 CR3: 0000000c82e0a005 CR4: 00000000003606e0
Mar 10 17:48:04 scbox24 kernel: [258870.531915] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 10 17:48:04 scbox24 kernel: [258870.532519] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 10 17:48:04 scbox24 kernel: [258870.533109] Call Trace: 
Mar 10 17:48:04 scbox24 kernel: [258870.533788]  ? _nv000093rm+0x432/0x7d0 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.534457]  ? _nv027022rm+0x247/0x610 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.535137]  ? _nv003652rm+0xd/0x20 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.535791]  ? _nv004258rm+0x15/0x80 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.536419]  ? _nv012022rm+0x194/0x290 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.537029]  ? _nv035079rm+0xf8/0x1a0 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.537631]  ? _nv035078rm+0x1b9/0x2c0 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.538212]  ? _nv033880rm+0x1c/0x30 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.538769]  ? _nv001090rm+0x62/0xc0 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.539337]  ? rm_free_unused_clients+0xc1/0xe0 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.539844]  ? nvidia_close+0x1f3/0x360 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.540338]  ? nvidia_frontend_close+0x2f/0x50 [nvidia]
Mar 10 17:48:04 scbox24 kernel: [258870.540762]  ? __fput+0xea/0x220
Mar 10 17:48:04 scbox24 kernel: [258870.541169]  ? ____fput+0xe/0x10
Mar 10 17:48:04 scbox24 kernel: [258870.541563]  ? task_work_run+0x8a/0xb0
Mar 10 17:48:04 scbox24 kernel: [258870.541944]  ? do_exit+0x2de/0xb50
Mar 10 17:48:04 scbox24 kernel: [258870.542310]  ? do_group_exit+0x43/0xb0
Mar 10 17:48:04 scbox24 kernel: [258870.542664]  ? get_signal+0x296/0x5c0
Mar 10 17:48:04 scbox24 kernel: [258870.543021]  ? do_signal+0x37/0x740
Mar 10 17:48:04 scbox24 kernel: [258870.543358]  ? new_sync_read+0xe2/0x130
Mar 10 17:48:04 scbox24 kernel: [258870.543672]  ? __vfs_read+0x29/0x40
Mar 10 17:48:04 scbox24 kernel: [258870.543975]  ? vfs_read+0x93/0x130
Mar 10 17:48:04 scbox24 kernel: [258870.544281]  ? exit_to_usermode_loop+0x80/0xd0
Mar 10 17:48:04 scbox24 kernel: [258870.544561]  ? do_syscall_64+0xf4/0x130
Mar 10 17:48:04 scbox24 kernel: [258870.544849]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Mar 10 17:48:04 scbox24 kernel: [258870.545128] Code: 5c 48 21 d0 c3 0f 1f 80 00 00 00 00 41 54 49 89 f4 89 d6 53 48 83 ec 08 49 8b 9c 24 68 1f 00 00 e8 b7 08 00 00 8b 50 78 4c 89 e7 <4c> 8b 9b 28 0f 00 00 48 89 de 48 83 c4 08 5b 
Mar 10 17:48:04 scbox24 kernel: [258870.546172] RIP: _nv026973rm+0x1f/0x40 [nvidia] RSP: ffffaa70cfa079a0
Mar 10 17:48:04 scbox24 kernel: [258870.546481] CR2: 0000000000000f28
Mar 10 17:48:04 scbox24 kernel: [258870.546785] ---[ end trace 3a8f13c0240c393c ]---
Mar 10 17:48:04 scbox24 kernel: [258870.547100] Fixing recursive fault but reboot is needed!

Originally, the 410.78 were installed from the official run. I tried upgrading to 410.104 and when that didn’t work I uninstalled it and used the ppa:graphics-drivers to install nvidia-410. The machine still throws this error after about a day on average when the machine is busy running a CUDA based app.

After this errors occurs, the system becomes unusable for the application.

Any ideas?
nvidia-bug-report.log.gz (1.53 MB)
oops.log (60.8 KB)

Please check if this applies:
https://devtalk.nvidia.com/default/topic/1043126/linux/xid-8-in-various-cuda-deep-learning-applications-for-nvidia-gtx-1080-ti/
Shouldn’t crash the driver, though.