Hi NVidia devs,
I’ve been experiencing kernel crashdumps in the past couple months with NVidia driver 430.* on RHEL7.7.
After a colleague investigated, we found out that _nv002453kms in nvidia_modeset was the cause of the crashes.
Here’s what we’ve found:
Extract from the RIP on the last few crashes:
127.0.0.1-2019-09-14-10:16:11/vmcore-dmesg.txt:[658571.742767] RIP: 0010:[<ffffffffc0dac600>] [<ffffffffc0dac600>] _nv002453kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-09-15-11:41:55/vmcore-dmesg.txt:[91039.312415] RIP: 0010:[<ffffffffc10d0600>] [<ffffffffc10d0600>] _nv002453kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-09-24-14:28:31/vmcore-dmesg.txt:[349453.025883] RIP: 0010:[<ffffffffc13fa620>] [<ffffffffc13fa620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-10-03-11:54:19/vmcore-dmesg.txt:[228015.083548] RIP: 0010:[<ffffffffc127c620>] [<ffffffffc127c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-10-10-22:14:54/vmcore-dmesg.txt:[512664.217091] RIP: 0010:[<ffffffffc0e8c620>] [<ffffffffc0e8c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-10-11-07:54:43/vmcore-dmesg.txt:[34474.174799] RIP: 0010:[<ffffffffb1e56b2b>] [<ffffffffb1e56b2b>] path_init+0x33b/0x3f0
127.0.0.1-2019-10-10-22:14:54/vmcore-dmesg.txt:[512664.217091] RIP: 0010:[<ffffffffc0e8c620>] [<ffffffffc0e8c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
crash> sys
CPUS: 32
DATE: Thu Oct 10 22:14:39 2019
UPTIME: 5 days, 22:44:29
LOAD AVERAGE: 3.23, 4.97, 7.91
TASKS: 3132
NODENAME: daltigoth
RELEASE: 3.10.0-1062.4.1.el7.x86_64
VERSION: #1 SMP Wed Sep 25 15:03:35 EDT 2019
MACHINE: x86_64 (2100 Mhz)
MEMORY: 255.5 GB
PANIC: "BUG: unable to handle kernel paging request at 0000000000006f80"
* The panic task was "X". Its traces were as follows :
crash> bt
PID: 167023 TASK: ffff8e3c85a25230 CPU: 5 COMMAND: "X"
#0 [ffff8e24786bf830] machine_kexec at ffffffffb62657e4
#1 [ffff8e24786bf890] __crash_kexec at ffffffffb6320a72
#2 [ffff8e24786bf960] crash_kexec at ffffffffb6320b60
#3 [ffff8e24786bf978] oops_end at ffffffffb6983798
#4 [ffff8e24786bf9a0] no_context at ffffffffb6274bb4
#5 [ffff8e24786bf9f0] __bad_area_nosemaphore at ffffffffb6274e82
#6 [ffff8e24786bfa40] bad_area_nosemaphore at ffffffffb6274fa4
#7 [ffff8e24786bfa50] __do_page_fault at ffffffffb6986750
#8 [ffff8e24786bfac0] do_page_fault at ffffffffb6986975
#9 [ffff8e24786bfaf0] page_fault at ffffffffb6982778
[exception RIP: _nv002454kms+96]
RIP: ffffffffc0e8c620 RSP: ffff8e24786bfba0 RFLAGS: 00010202
RAX: 0000000000000004 RBX: 0000000000006f80 RCX: 0000000000000004
RDX: ffff8e3c52b92318 RSI: 0000000000006f80 RDI: ffff8e3c52b93008
RBP: 0000000000000000 R8: 0000000000000400 R9: 0000000000000000
R10: 0000000000000004 R11: ffff8e24786bf5d6 R12: 0000000000006f80
R13: 0000000000006f80 R14: ffff8e3c52b93008 R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff8e24786bfbd8] _nv002597kms at ffffffffc0e78cea [nvidia_modeset]
#11 [ffff8e24786bfdb8] nvKmsIoctl at ffffffffc0e49886 [nvidia_modeset]
#12 [ffff8e24786bfe08] nvkms_ioctl_common at ffffffffc0e46012 [nvidia_modeset]
#13 [ffff8e24786bfe38] nvkms_ioctl at ffffffffc0e46113 [nvidia_modeset]
#14 [ffff8e24786bfe70] nvidia_frontend_unlocked_ioctl at ffffffffc1fe3083 [nvidia]
#15 [ffff8e24786bfe80] do_vfs_ioctl at ffffffffb645e270
#16 [ffff8e24786bff00] sys_ioctl at ffffffffb645e511
#17 [ffff8e24786bff50] system_call_fastpath at ffffffffb698bede
RIP: 00007f79f31b12f7 RSP: 00007ffde7619ca0 RFLAGS: 00003206
RAX: 0000000000000010 RBX: 00007ffde761fde0 RCX: 0000000000000000
RDX: 00007ffde7619d20 RSI: 00000000c0106d00 RDI: 0000000000000011
RBP: 0000000000000011 R8: 0000000000000000 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000003246 R12: 0000565303418ec0
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
- Kernel Ring Buffer indicates the crash was taking place in the “_nv002454kms” function of “nvidia_modeset” :
#>> egrep "nvidia_modeset" lsmod
nvidia_modeset 1112578 9 nvidia_drm
nvidia 19040853 387 nvidia_modeset,nvidia_uvm
#>> egrep "nvidia_modeset" proc/modules
nvidia_modeset 1112578 9 nvidia_drm, Live 0xffffffffc1e6c000 (POE)
nvidia 19040853 387 nvidia_modeset,nvidia_uvm, Live 0xffffffffc0b7a000 (POE)
#>> less sos_commands/kernel/modinfo_ALL_MODULES
filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/weak-updates/nvidia/nvidia-modeset.ko
version: 430.50
supported: external
license: NVIDIA
retpoline: Y
rhelversion: 7.7
srcversion: 0812C6DC2E101D617DA9F87
depends: nvidia
vermagic: 3.10.0-1062.3.2.el7.x86_64 SMP mod_unload modversions