Kernel oops every 20 seconds

elFarto · August 30, 2017, 6:42pm

Hi

It seems after upgrading to the latest drive update from negativo17’s repository my kernel is oops’ing every 20 seconds:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000378
IP: rm_get_device_name+0x9a/0x1b0 [nvidia]
PGD 41abcd067 
P4D 41abcd067 
PUD 0 
Oops: 0000 [#1] SMP
Modules linked in: nvidia_drm(POE+) nvidia_modeset(POE) nvidia(POE) intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_generic irqbypass crct10dif_pclmul snd_hda_intel crc32_pclmul snd_hda_codec eeepc_wmi asus_wmi iTCO_wdt iTCO_vendor_support sparse_keymap rfkill ghash_clmulni_intel drm_kms_helper joydev snd_hda_core snd_hwdep intel_cstate snd_seq intel_uncore snd_seq_device intel_rapl_perf drm i2c_i801 snd_pcm mei_me snd_timer mei snd tpm_infineon soundcore lpc_ich tpm_tis shpchp tpm_tis_core tpm mxm_wmi crc32c_intel r8169 uas hid_pl mii ff_memless usb_storage video wmi
CPU: 3 PID: 877 Comm: cat Tainted: P           OE   4.12.8-300.fc26.x86_64 #1
Hardware name: ASUS All Series/Z87-A, BIOS 2103 08/15/2014
task: ffff9ab6998acb80 task.stack: ffffb64182744000
RIP: 0010:rm_get_device_name+0x9a/0x1b0 [nvidia]
RSP: 0018:ffffb64182747c18 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff9ab68f884008 RCX: ffffb64182747c44
RDX: 000000000000002c RSI: 0000000000000000 RDI: ffff9ab68f884008
RBP: ffff9ab69a973000 R08: 0000000000001438 R09: 0000000000000028
R10: ffffb64182747d30 R11: 000000000001e508 R12: 00000000000019da
R13: 0000000000001438 R14: ffff9ab69a970000 R15: ffffb64182747d88
FS:  00007f5a01ab7700(0000) GS:ffff9ab6aecc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000378 CR3: 000000041abd0000 CR4: 00000000001406e0
Call Trace:
 ? get_page_from_freelist+0x963/0xb60
 ? kmem_cache_alloc+0x91/0x1b0
 ? nv_procfs_close_unbind_lock+0x171/0x1d0 [nvidia]
 ? nv_procfs_read_gpu_info+0x2f4/0x350 [nvidia]
 ? __kmalloc_node+0x202/0x2c0
 ? kvmalloc_node+0x85/0x90
 ? seq_read+0xc9/0x3f0
 ? proc_reg_read+0x42/0x70
 ? __vfs_read+0x37/0x150
 ? security_file_permission+0x9b/0xc0
 ? vfs_read+0x8e/0x130
 ? SyS_read+0x55/0xc0
 ? do_syscall_64+0x67/0x140
 ? entry_SYSCALL64_slow_path+0x25/0x25
Code: ff f7 ff 48 85 db 74 3f 0f b7 93 3e 0c 00 00 48 8b 83 00 1e 00 00 48 8d 4c 24 2c 48 89 df 48 89 c6 66 89 54 24 1e ba 2c 00 00 00 <ff> 90 78 03 00 00 85 c0 0f 85 c9 00 00 00 8b 44 24 2c 41 89 c5 
RIP: rm_get_device_name+0x9a/0x1b0 [nvidia] RSP: ffffb64182747c18
CR2: 0000000000000378

I can’t even track down what’s causing it to crash. It says PID 877 (or 864 occasionally) and a command of ‘cat’, but I can’t seem to catch what’s starting it. I’ve tried all the kernel versions I’ve got installed (4.12.8, 4.12.5 and 4.11.11) but they all have the same issue.

I’m not entirely sure it’s the kernel driver that’s causing it, as that didn’t update yesterday, only the user space stuff did.

I ran the nvidia-bug-report script, but that caused a crash too:

Aug 30 19:28:42 Thor systemd-coredump[6670]: Process 6667 (nvidia-settings) of user 0 dumped core.
                                             
                                             Stack trace of thread 6667:
                                             #0  0x00007fdf8e96269b __GI_raise (libc.so.6)
                                             #1  0x00007fdf8e9644a0 __GI_abort (libc.so.6)
                                             #2  0x00007fdf8e95ad5a __assert_fail_base (libc.so.6)
                                             #3  0x00007fdf8e95add2 __GI___assert_fail (libc.so.6)
                                             #4  0x000000000041a72d NvCtrlNvmlGetValidAttributeValues (nvidia-settings)
                                             #5  0x00000000004140cb NvCtrlGetValidDisplayAttributeValues (nvidia-settings)
                                             #6  0x000000000040db1e nv_process_assignments_and_queries (nvidia-settings)
                                             #7  0x0000000000406b75 main (nvidia-settings)
                                             #8  0x00007fdf8e94c50a __libc_start_main (libc.so.6)
                                             #9  0x0000000000406e1a _start (nvidia-settings)

I’ve attached what it did manage to produce.

Regards
elFarto
nvidia-bug-report.log.gz (71.8 KB)

generix · August 30, 2017, 7:03pm

It looks like the driver is incorrectly installed, it’s using the mesa glx. Purge and reinstall the nvidia drivers.

elFarto · August 30, 2017, 8:19pm

Seems reinstalling the glvnd packages seemed to resolve the issue. I’m not convinced that was the problem.

I don’t see any mention of mesa in the log file, what led you to think it was that?

Regards
elFarto

generix · August 31, 2017, 8:38am

[    60.166] (II) Module glx: vendor="X.Org Foundation"
[    60.166] 	compiled for 1.19.3, module version = 1.0.0
[    60.166] 	ABI class: X.Org Server Extension, version 10.0

Vendor should be Nvidia

elFarto · September 3, 2017, 7:22am

Hi

Even after upgrading to 4.12.9 and 384.69 I’m still getting this oops, now every second. I’m not sure those X logs in my previous bug report are valid, as they don’t seem to be updated recently.

edit After completely removing and reinstalling it’s working fine again. Lets hope it stays that way.

Sep  3 08:04:25 Thor kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  384.69  Wed Aug 16 19:34:54 PDT 2017 (using threaded interrupts)
Sep  3 08:04:25 Thor systemd-udevd[704]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -
f 1) 255' failed with exit code 1.
Sep  3 08:04:25 Thor kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  384.69  Wed Aug 16 19:39:44 PDT 2017
Sep  3 08:04:25 Thor kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Sep  3 08:04:25 Thor kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000378
Sep  3 08:04:25 Thor kernel: IP: rm_get_device_name+0x9a/0x1a0 [nvidia]
Sep  3 08:04:25 Thor kernel: PGD 0 
Sep  3 08:04:25 Thor kernel: P4D 0 
Sep  3 08:04:25 Thor kernel: 
Sep  3 08:04:25 Thor kernel: Oops: 0000 [#1] SMP
Sep  3 08:04:25 Thor kernel: Modules linked in: nvidia_drm(POE+) nvidia_modeset(POE) nvidia(POE) intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp eeepc
_wmi asus_wmi sparse_keymap snd_hda_codec_realtek rfkill snd_hda_codec_generic kvm_intel kvm snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq irqbypas
s iTCO_wdt drm_kms_helper iTCO_vendor_support snd_seq_device crct10dif_pclmul crc32_pclmul drm snd_pcm ghash_clmulni_intel intel_cstate snd_timer intel_uncore s
nd mei_me intel_rapl_perf joydev i2c_i801 mei lpc_ich soundcore shpchp tpm_infineon tpm_tis tpm_tis_core tpm mxm_wmi crc32c_intel r8169 uas usb_storage mii hid_
pl ff_memless wmi video
Sep  3 08:04:25 Thor kernel: CPU: 6 PID: 867 Comm: cat Tainted: P           OE   4.12.9-300.fc26.x86_64 #1
Sep  3 08:04:25 Thor kernel: Hardware name: ASUS All Series/Z87-A, BIOS 2103 08/15/2014
Sep  3 08:04:25 Thor kernel: task: ffff9c9fd6b30000 task.stack: ffffbe56027d4000
Sep  3 08:04:25 Thor kernel: RIP: 0010:rm_get_device_name+0x9a/0x1a0 [nvidia]
Sep  3 08:04:25 Thor kernel: RSP: 0018:ffffbe56027d7c18 EFLAGS: 00010282
Sep  3 08:04:25 Thor kernel: RAX: 0000000000000000 RBX: ffff9c9fda29c008 RCX: ffffbe56027d7c44
Sep  3 08:04:25 Thor kernel: RDX: 000000000000002c RSI: 0000000000000000 RDI: ffff9c9fda29c008
Sep  3 08:04:25 Thor kernel: RBP: ffff9c9fdaade000 R08: 0000000000001438 R09: 0000000000000028
Sep  3 08:04:25 Thor kernel: R10: ffffbe56027d7d30 R11: ffff9c9fde002c00 R12: 00000000000019da
Sep  3 08:04:25 Thor kernel: R13: 0000000000001438 R14: ffff9c9fdaadb000 R15: ffffbe56027d7d88
Sep  3 08:04:25 Thor kernel: FS:  00007ff67fff4700(0000) GS:ffff9c9feed80000(0000) knlGS:0000000000000000
Sep  3 08:04:25 Thor kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  3 08:04:25 Thor kernel: CR2: 0000000000000378 CR3: 0000000419ed5000 CR4: 00000000001406e0
Sep  3 08:04:25 Thor kernel: Call Trace:
Sep  3 08:04:25 Thor kernel: ? get_page_from_freelist+0x963/0xb60
Sep  3 08:04:25 Thor kernel: ? kmem_cache_alloc+0x130/0x1b0
Sep  3 08:04:25 Thor kernel: ? nv_procfs_read_gpu_info+0x41/0x350 [nvidia]
Sep  3 08:04:25 Thor kernel: ? nv_procfs_read_gpu_info+0x2f4/0x350 [nvidia]
Sep  3 08:04:25 Thor kernel: ? __kmalloc_node+0x202/0x2c0
Sep  3 08:04:25 Thor kernel: ? kvmalloc_node+0x85/0x90
Sep  3 08:04:25 Thor kernel: ? seq_read+0xc9/0x3f0
Sep  3 08:04:25 Thor kernel: ? proc_reg_read+0x42/0x70
Sep  3 08:04:25 Thor kernel: ? __vfs_read+0x37/0x150
Sep  3 08:04:25 Thor kernel: ? security_file_permission+0x9b/0xc0
Sep  3 08:04:25 Thor kernel: ? vfs_read+0x8e/0x130
Sep  3 08:04:25 Thor kernel: ? SyS_read+0x55/0xc0
Sep  3 08:04:25 Thor kernel: ? do_syscall_64+0x67/0x140
Sep  3 08:04:25 Thor kernel: ? entry_SYSCALL64_slow_path+0x25/0x25
Sep  3 08:04:25 Thor kernel: Code: 18 f8 ff 48 85 db 74 3f 0f b7 93 56 0c 00 00 48 8b 83 18 1e 00 00 48 8d 4c 24 2c 48 89 df 48 89 c6 66 89 54 24 1e ba 2c 00 00 00 <ff> 90 78 03 00 00 85 c0 0f 85 c9 00 00 00 8b 44 24 2c 41 89 c5 
Sep  3 08:04:25 Thor kernel: RIP: rm_get_device_name+0x9a/0x1a0 [nvidia] RSP: ffffbe56027d7c18
Sep  3 08:04:25 Thor kernel: CR2: 0000000000000378

Regards
Stephen

elFarto · October 4, 2017, 5:10pm

edit Ok, false alarm. I’m still getting occasional oops, but the once-a-second oops is actually an issue in abrt causing it to replay the same error over and over.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000378
IP: rm_get_device_name+0x9a/0x1a0 [nvidia]
PGD 0 
P4D 0 
Oops: 0000 [#1] SMP
Modules linked in: nvidia_drm(POE+) snd_hda_codec_generic intel_rapl nvidia_modeset(POE) acpi_cpufreq(-) nvidia_uvm(POE) x86_pkg_temp_thermal intel_powerclamp snd_hda_intel nvidia(POE) coretemp kvm_intel snd_hda_codec kvm snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support eeepc_wmi crc32_pclmul drm_kms_helper asus_wmi sparse_keymap ghash_clmulni_intel rfkill snd_timer intel_cstate intel_uncore drm joydev mei_me intel_rapl_perf snd i2c_i801 soundcore mei lpc_ich shpchp tpm_infineon tpm_tis tpm_tis_core tpm mxm_wmi crc32c_intel r8169 mii uas usb_storage hid_pl ff_memless video wmi
CPU: 5 PID: 864 Comm: cat Tainted: P           OE   4.12.14-300.fc26.x86_64 #1
Hardware name: ASUS All Series/Z87-A, BIOS 2103 08/15/2014
task: ffff93c759000000 task.stack: ffff9fb8c2148000
RIP: 0010:rm_get_device_name+0x9a/0x1a0 [nvidia]
RSP: 0018:ffff9fb8c214bc18 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff93c75ab48008 RCX: ffff9fb8c214bc44
RDX: 000000000000002c RSI: 0000000000000000 RDI: ffff93c75ab48008
RBP: ffff93c75b17b000 R08: 0000000000001438 R09: 0000000000000028
R10: ffff9fb8c214bd30 R11: 000000000001e548 R12: 00000000000019da
R13: 0000000000001438 R14: ffff93c75b178000 R15: ffff9fb8c214bd88
FS:  00007f034c38c700(0000) GS:ffff93c76ed40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000378 CR3: 000000041999f000 CR4: 00000000001406e0
Call Trace:
 ? get_page_from_freelist+0x963/0xb60
 ? kmem_cache_alloc+0xb1/0x1b0
 ? nv_procfs_close_unbind_lock+0x131/0x1d0 [nvidia]
 ? nv_procfs_read_gpu_info+0x2f4/0x350 [nvidia]
 ? __kmalloc_node+0x202/0x2c0
 ? kvmalloc_node+0x85/0x90
 ? seq_read+0xc9/0x3f0
 ? proc_reg_read+0x42/0x70
 ? __vfs_read+0x37/0x150
 ? security_file_permission+0x9b/0xc0
 ? vfs_read+0x8e/0x130
 ? SyS_read+0x55/0xc0
 ? do_syscall_64+0x67/0x140
 ? entry_SYSCALL64_slow_path+0x25/0x25
Code: 18 f8 ff 48 85 db 74 3f 0f b7 93 56 0c 00 00 48 8b 83 18 1e 00 00 48 8d 4c 24 2c 48 89 df 48 89 c6 66 89 54 24 1e ba 2c 00 00 00 <ff> 90 78 03 00 00 85 c0 0f 85 c9 00 00 00 8b 44 24 2c 41 89 c5 
RIP: rm_get_device_name+0x9a/0x1a0 [nvidia] RSP: ffff9fb8c214bc18
CR2: 0000000000000378

Oct 04 18:08:24 Thor abrt-notification[8908]: System encountered a non-fatal error in get_page_from_freelist()

Oct 04 18:08:25 Thor abrt-notification[8928]: System encountered a non-fatal error in ??()

Oct 04 18:08:26 Thor abrt-notification[8937]: System encountered a non-fatal error in get_page_from_freelist()

Oct 04 18:08:27 Thor abrt-notification[8957]: System encountered a non-fatal error in ??()

Oct 04 18:08:29 Thor abrt-notification[8966]: System encountered a non-fatal error in get_page_from_freelist()

Oct 04 18:08:30 Thor abrt-notification[8986]: System encountered a non-fatal error in ??()

Oct 04 18:08:31 Thor abrt-notification[8995]: System encountered a non-fatal error in get_page_from_freelist()

Oct 04 18:08:32 Thor abrt-notification[9015]: System encountered a non-fatal error in ??()

Oct 04 18:08:34 Thor abrt-notification[9024]: System encountered a non-fatal error in get_page_from_freelist()

edit Here’s another oops that’s showing up a lot:

BUG: unable to handle kernel paging request at ffffa82202717cb8
IP: _nv006723rm+0x89/0xe0 [nvidia]
PGD 41e124067 
P4D 41e124067 
PUD 41e125067 
PMD 417d2e067 
PTE 0
Oops: 0000 [#2] SMP
Modules linked in: snd_hda_codec_hdmi nvidia_drm(POE+) nvidia_modeset(POE) nvidia(POE) snd_hda_codec_realtek snd_hda_codec_generic intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec snd_hda_core snd_hwdep eeepc_wmi iTCO_wdt kvm_intel snd_seq asus_wmi iTCO_vendor_support snd_seq_device sparse_keymap rfkill kvm snd_pcm drm_kms_helper drm irqbypass mei_me snd_timer crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel intel_cstate intel_uncore snd lpc_ich i2c_i801 intel_rapl_perf soundcore mei shpchp tpm_infineon tpm_tis tpm_tis_core tpm r8169 mxm_wmi crc32c_intel mii uas hid_pl usb_storage ff_memless wmi video
CPU: 4 PID: 86 Comm: kworker/4:1 Tainted: P      D    OE   4.12.14-300.fc26.x86_64 #1
Hardware name: ASUS All Series/Z87-A, BIOS 2103 08/15/2014
Workqueue: events os_execute_work_item [nvidia]
task: ffff95f65bf4a5c0 task.stack: ffffa82201bb4000
RIP: 0010:_nv006723rm+0x89/0xe0 [nvidia]
RSP: 0018:ffffa82201bb7d30 EFLAGS: 00010082
RAX: ffffa82202717c98 RBX: ffffa82202307900 RCX: ffffa82201bb7de0
RDX: ffffa82202307900 RSI: ffffa82202377c98 RDI: ffffffffc10bd8b8
RBP: ffff95f64e782ff8 R08: 000000000001e593 R09: ffff95f64e780000
R10: ffffa82201bb7de8 R11: 000000000001e548 R12: ffffa82201bb7de0
R13: ffffffffc10bd8b8 R14: ffffa82201bb7e50 R15: ffff95f65c370cc0
FS:  0000000000000000(0000) GS:ffff95f66ed00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffa82202717cb8 CR3: 0000000417239000 CR4: 00000000001406e0
Call Trace:
 ? _nv006722rm+0x6a/0x90 [nvidia]
 ? _nv024818rm+0x13/0x50 [nvidia]
 ? _nv034129rm+0x11b/0x1d0 [nvidia]
 ? rm_execute_work_item+0x41/0xc0 [nvidia]
 ? os_alloc_mem+0xc1/0xe0 [nvidia]
 ? os_execute_work_item+0x46/0x70 [nvidia]
 ? process_one_work+0x193/0x3c0
 ? worker_thread+0x4a/0x3a0
 ? kthread+0x125/0x140
 ? process_one_work+0x3c0/0x3c0
 ? kthread_park+0x60/0x60
 ? ret_from_fork+0x25/0x30
Code: 66 90 4c 39 63 10 74 4b 48 8b 73 08 c6 43 20 00 4c 89 ef c6 46 20 01 e8 c6 fe ff ff eb 94 0f 1f 40 00 48 8b 46 18 48 85 c0 74 09 <80> 78 20 00 0f 1f 00 75 aa 4c 39 63 18 74 2d 48 8b 73 08 c6 43 
RIP: _nv006723rm+0x89/0xe0 [nvidia] RSP: ffffa82201bb7d30
CR2: ffffa82202717cb8

Regards
elFarto

generix · October 6, 2017, 12:28pm

Please provide another nvidia-bug-report with oopses visible.