RHEL 7.7 + 430.50 : random kernel panics in _nv002453kms (nvidia_modeset)

vincento4him · October 15, 2019, 1:46pm

Hi NVidia devs,

I’ve been experiencing kernel crashdumps in the past couple months with NVidia driver 430.* on RHEL7.7.
After a colleague investigated, we found out that _nv002453kms in nvidia_modeset was the cause of the crashes.
Here’s what we’ve found:

Extract from the RIP on the last few crashes:

127.0.0.1-2019-09-14-10:16:11/vmcore-dmesg.txt:[658571.742767] RIP: 0010:[<ffffffffc0dac600>]  [<ffffffffc0dac600>] _nv002453kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-09-15-11:41:55/vmcore-dmesg.txt:[91039.312415] RIP: 0010:[<ffffffffc10d0600>]  [<ffffffffc10d0600>] _nv002453kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-09-24-14:28:31/vmcore-dmesg.txt:[349453.025883] RIP: 0010:[<ffffffffc13fa620>]  [<ffffffffc13fa620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-10-03-11:54:19/vmcore-dmesg.txt:[228015.083548] RIP: 0010:[<ffffffffc127c620>]  [<ffffffffc127c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-10-10-22:14:54/vmcore-dmesg.txt:[512664.217091] RIP: 0010:[<ffffffffc0e8c620>]  [<ffffffffc0e8c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-10-11-07:54:43/vmcore-dmesg.txt:[34474.174799] RIP: 0010:[<ffffffffb1e56b2b>]  [<ffffffffb1e56b2b>] path_init+0x33b/0x3f0
127.0.0.1-2019-10-10-22:14:54/vmcore-dmesg.txt:[512664.217091] RIP: 0010:[<ffffffffc0e8c620>]  [<ffffffffc0e8c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]

crash> sys
        CPUS: 32
        DATE: Thu Oct 10 22:14:39 2019
      UPTIME: 5 days, 22:44:29
LOAD AVERAGE: 3.23, 4.97, 7.91
       TASKS: 3132
    NODENAME: daltigoth
     RELEASE: 3.10.0-1062.4.1.el7.x86_64
     VERSION: #1 SMP Wed Sep 25 15:03:35 EDT 2019
     MACHINE: x86_64  (2100 Mhz)
      MEMORY: 255.5 GB
       PANIC: "BUG: unable to handle kernel paging request at 0000000000006f80"


* The panic task was "X". Its traces were as follows :

crash> bt
PID: 167023  TASK: ffff8e3c85a25230  CPU: 5   COMMAND: "X"
 #0 [ffff8e24786bf830] machine_kexec at ffffffffb62657e4
 #1 [ffff8e24786bf890] __crash_kexec at ffffffffb6320a72
 #2 [ffff8e24786bf960] crash_kexec at ffffffffb6320b60
 #3 [ffff8e24786bf978] oops_end at ffffffffb6983798
 #4 [ffff8e24786bf9a0] no_context at ffffffffb6274bb4
 #5 [ffff8e24786bf9f0] __bad_area_nosemaphore at ffffffffb6274e82
 #6 [ffff8e24786bfa40] bad_area_nosemaphore at ffffffffb6274fa4
 #7 [ffff8e24786bfa50] __do_page_fault at ffffffffb6986750
 #8 [ffff8e24786bfac0] do_page_fault at ffffffffb6986975
 #9 [ffff8e24786bfaf0] page_fault at ffffffffb6982778
    [exception RIP: _nv002454kms+96]
    RIP: ffffffffc0e8c620  RSP: ffff8e24786bfba0  RFLAGS: 00010202
    RAX: 0000000000000004  RBX: 0000000000006f80  RCX: 0000000000000004
    RDX: ffff8e3c52b92318  RSI: 0000000000006f80  RDI: ffff8e3c52b93008
    RBP: 0000000000000000   R8: 0000000000000400   R9: 0000000000000000
    R10: 0000000000000004  R11: ffff8e24786bf5d6  R12: 0000000000006f80
    R13: 0000000000006f80  R14: ffff8e3c52b93008  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8e24786bfbd8] _nv002597kms at ffffffffc0e78cea [nvidia_modeset]
#11 [ffff8e24786bfdb8] nvKmsIoctl at ffffffffc0e49886 [nvidia_modeset]
#12 [ffff8e24786bfe08] nvkms_ioctl_common at ffffffffc0e46012 [nvidia_modeset]
#13 [ffff8e24786bfe38] nvkms_ioctl at ffffffffc0e46113 [nvidia_modeset]
#14 [ffff8e24786bfe70] nvidia_frontend_unlocked_ioctl at ffffffffc1fe3083 [nvidia]
#15 [ffff8e24786bfe80] do_vfs_ioctl at ffffffffb645e270
#16 [ffff8e24786bff00] sys_ioctl at ffffffffb645e511
#17 [ffff8e24786bff50] system_call_fastpath at ffffffffb698bede
    RIP: 00007f79f31b12f7  RSP: 00007ffde7619ca0  RFLAGS: 00003206
    RAX: 0000000000000010  RBX: 00007ffde761fde0  RCX: 0000000000000000
    RDX: 00007ffde7619d20  RSI: 00000000c0106d00  RDI: 0000000000000011
    RBP: 0000000000000011   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000003246  R12: 0000565303418ec0
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

Kernel Ring Buffer indicates the crash was taking place in the “_nv002454kms” function of “nvidia_modeset” :

#>> egrep "nvidia_modeset" lsmod 
nvidia_modeset       1112578  9 nvidia_drm
nvidia              19040853  387 nvidia_modeset,nvidia_uvm


#>> egrep "nvidia_modeset" proc/modules 
nvidia_modeset 1112578 9 nvidia_drm, Live 0xffffffffc1e6c000 (POE)
nvidia 19040853 387 nvidia_modeset,nvidia_uvm, Live 0xffffffffc0b7a000 (POE)


#>> less sos_commands/kernel/modinfo_ALL_MODULES 
filename:       /lib/modules/3.10.0-1062.4.1.el7.x86_64/weak-updates/nvidia/nvidia-modeset.ko
version:        430.50
supported:      external
license:        NVIDIA
retpoline:      Y
rhelversion:    7.7
srcversion:     0812C6DC2E101D617DA9F87
depends:        nvidia
vermagic:       3.10.0-1062.3.2.el7.x86_64 SMP mod_unload modversions

CSC-i · March 6, 2020, 10:16am

We’ve also been seeing this in both laptops and workstations.

Common factors: RHEL7.7
Kernel: 3.10.0-1062.4.1.el7.x86_64
Cards: Quadro P2000 (rev a1) / Quadro M1200 Mobile (rev a2)
Drivers tried: 430.40 / 440.36 / 440.44 / 440.59

Crash cause: BUG: unable to handle kernel paging request at {0000000000006c00} (nvidia_modeset)

Please let me know how to escalate this.

vincento4him · March 6, 2020, 2:09pm

@CSC-i , Are you a RHEL customer? At any case, please do open a service request. I’ve been getting nothing -but- blissful ignorance from NVidia forums.

I work for Red Hat but since this is my personal homelab, I didn’t want to escalate this using TSANet because it’s not only a customer-impacting issue.
If you are a RHEL customer, please do open an SR with NVidia and reference my (multiple) posts so that we can escalate this to NVidia through TSANet.

On a side note: Not a single crash since I’ve been on 440.64 but it’s too early to say.

CSC-i · March 10, 2020, 12:46pm

Thanks Vinceto.

I’ll try 440.64 on one of the failing machines and if that continues to exhibit the same issues will open an SR with Nividia.

vincento4him · March 10, 2020, 1:01pm

Hi, if you have the capability to open an SR with NVidia, that would be the best option. I tried raising awareness here (RedHat) but since I’m only a consultant and I didn’t have a Customer being hit by the issue, it went nowhere.
PIf you have a support contract with RedHat too, please try opening an SR too and reference the NVidia SR so RH can work with NVidia engineering too.
Thank you,
Vincent

advorkin · April 8, 2021, 8:12pm

Hello,

We are having a similar problem and I’m wondering if you ever got this resolved?

Thank you.

Topic		Replies	Views
RHEL7.7 + 440.59 : still getting kernel crashdumps in nvkms Linux	1	699	February 20, 2020
RHEL 7.7 + 430.52 : random kernel crashes Linux	12	2554	October 3, 2019
RHEL7.8 + 450.57 nvkms crashdump on Quadro P2200 GPU Linux	2	792	August 13, 2020
RHEL7.7 + 440.41 : kernel crashdump in nvkms when waking up from sleep (DPMS) Linux	2	1034	February 3, 2020
Still getting kernel crashdumps on RHEL7.8 + 440.82 in nvkms_alloc Linux hw , kernel , nvbugs	7	845	April 9, 2021
[BUG] nvidia_modeset causes kernel (5.*) xorg crash on RTX 2070 Super card Linux	3	1194	December 10, 2019
Kernel crash with 460.80, 460.84 and 465.31 drivers with Quadro P2000 but work normaly with Quadro P2200 (or 460.73.01 driver) Linux boot , kernel , driver	2	1215	July 23, 2021
Kernel panic due to a NULL pointer dereference at 0000000000002b20 Linux	5	2753	March 10, 2023
Centos7.9 nvi-driver 470.161.03 random kernel crash Linux	1	308	February 15, 2023
Still experiencing random crashdumps (RHEL7.7) : X: page allocation failure: order:4, mode:0x40d0 in nvkms_ioctl (nvidia_modeset) Linux	3	2030	January 6, 2020

RHEL 7.7 + 430.50 : random kernel panics in _nv002453kms (nvidia_modeset)

Related topics