RHEL 7.7 + 430.50 : random kernel panics in _nv002453kms (nvidia_modeset)

Hi NVidia devs,

I’ve been experiencing kernel crashdumps in the past couple months with NVidia driver 430.* on RHEL7.7.
After a colleague investigated, we found out that _nv002453kms in nvidia_modeset was the cause of the crashes.
Here’s what we’ve found:

Extract from the RIP on the last few crashes:

127.0.0.1-2019-09-14-10:16:11/vmcore-dmesg.txt:[658571.742767] RIP: 0010:[<ffffffffc0dac600>]  [<ffffffffc0dac600>] _nv002453kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-09-15-11:41:55/vmcore-dmesg.txt:[91039.312415] RIP: 0010:[<ffffffffc10d0600>]  [<ffffffffc10d0600>] _nv002453kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-09-24-14:28:31/vmcore-dmesg.txt:[349453.025883] RIP: 0010:[<ffffffffc13fa620>]  [<ffffffffc13fa620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-10-03-11:54:19/vmcore-dmesg.txt:[228015.083548] RIP: 0010:[<ffffffffc127c620>]  [<ffffffffc127c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-10-10-22:14:54/vmcore-dmesg.txt:[512664.217091] RIP: 0010:[<ffffffffc0e8c620>]  [<ffffffffc0e8c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
127.0.0.1-2019-10-11-07:54:43/vmcore-dmesg.txt:[34474.174799] RIP: 0010:[<ffffffffb1e56b2b>]  [<ffffffffb1e56b2b>] path_init+0x33b/0x3f0
127.0.0.1-2019-10-10-22:14:54/vmcore-dmesg.txt:[512664.217091] RIP: 0010:[<ffffffffc0e8c620>]  [<ffffffffc0e8c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
crash> sys
        CPUS: 32
        DATE: Thu Oct 10 22:14:39 2019
      UPTIME: 5 days, 22:44:29
LOAD AVERAGE: 3.23, 4.97, 7.91
       TASKS: 3132
    NODENAME: daltigoth
     RELEASE: 3.10.0-1062.4.1.el7.x86_64
     VERSION: #1 SMP Wed Sep 25 15:03:35 EDT 2019
     MACHINE: x86_64  (2100 Mhz)
      MEMORY: 255.5 GB
       PANIC: "BUG: unable to handle kernel paging request at 0000000000006f80"


* The panic task was "X". Its traces were as follows :

crash> bt
PID: 167023  TASK: ffff8e3c85a25230  CPU: 5   COMMAND: "X"
 #0 [ffff8e24786bf830] machine_kexec at ffffffffb62657e4
 #1 [ffff8e24786bf890] __crash_kexec at ffffffffb6320a72
 #2 [ffff8e24786bf960] crash_kexec at ffffffffb6320b60
 #3 [ffff8e24786bf978] oops_end at ffffffffb6983798
 #4 [ffff8e24786bf9a0] no_context at ffffffffb6274bb4
 #5 [ffff8e24786bf9f0] __bad_area_nosemaphore at ffffffffb6274e82
 #6 [ffff8e24786bfa40] bad_area_nosemaphore at ffffffffb6274fa4
 #7 [ffff8e24786bfa50] __do_page_fault at ffffffffb6986750
 #8 [ffff8e24786bfac0] do_page_fault at ffffffffb6986975
 #9 [ffff8e24786bfaf0] page_fault at ffffffffb6982778
    [exception RIP: _nv002454kms+96]
    RIP: ffffffffc0e8c620  RSP: ffff8e24786bfba0  RFLAGS: 00010202
    RAX: 0000000000000004  RBX: 0000000000006f80  RCX: 0000000000000004
    RDX: ffff8e3c52b92318  RSI: 0000000000006f80  RDI: ffff8e3c52b93008
    RBP: 0000000000000000   R8: 0000000000000400   R9: 0000000000000000
    R10: 0000000000000004  R11: ffff8e24786bf5d6  R12: 0000000000006f80
    R13: 0000000000006f80  R14: ffff8e3c52b93008  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8e24786bfbd8] _nv002597kms at ffffffffc0e78cea [nvidia_modeset]
#11 [ffff8e24786bfdb8] nvKmsIoctl at ffffffffc0e49886 [nvidia_modeset]
#12 [ffff8e24786bfe08] nvkms_ioctl_common at ffffffffc0e46012 [nvidia_modeset]
#13 [ffff8e24786bfe38] nvkms_ioctl at ffffffffc0e46113 [nvidia_modeset]
#14 [ffff8e24786bfe70] nvidia_frontend_unlocked_ioctl at ffffffffc1fe3083 [nvidia]
#15 [ffff8e24786bfe80] do_vfs_ioctl at ffffffffb645e270
#16 [ffff8e24786bff00] sys_ioctl at ffffffffb645e511
#17 [ffff8e24786bff50] system_call_fastpath at ffffffffb698bede
    RIP: 00007f79f31b12f7  RSP: 00007ffde7619ca0  RFLAGS: 00003206
    RAX: 0000000000000010  RBX: 00007ffde761fde0  RCX: 0000000000000000
    RDX: 00007ffde7619d20  RSI: 00000000c0106d00  RDI: 0000000000000011
    RBP: 0000000000000011   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000003246  R12: 0000565303418ec0
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
  • Kernel Ring Buffer indicates the crash was taking place in the “_nv002454kms” function of “nvidia_modeset” :
#>> egrep "nvidia_modeset" lsmod 
nvidia_modeset       1112578  9 nvidia_drm
nvidia              19040853  387 nvidia_modeset,nvidia_uvm


#>> egrep "nvidia_modeset" proc/modules 
nvidia_modeset 1112578 9 nvidia_drm, Live 0xffffffffc1e6c000 (POE)
nvidia 19040853 387 nvidia_modeset,nvidia_uvm, Live 0xffffffffc0b7a000 (POE)


#>> less sos_commands/kernel/modinfo_ALL_MODULES 
filename:       /lib/modules/3.10.0-1062.4.1.el7.x86_64/weak-updates/nvidia/nvidia-modeset.ko
version:        430.50
supported:      external
license:        NVIDIA
retpoline:      Y
rhelversion:    7.7
srcversion:     0812C6DC2E101D617DA9F87
depends:        nvidia
vermagic:       3.10.0-1062.3.2.el7.x86_64 SMP mod_unload modversions

We’ve also been seeing this in both laptops and workstations.

Common factors: RHEL7.7
Kernel: 3.10.0-1062.4.1.el7.x86_64
Cards: Quadro P2000 (rev a1) / Quadro M1200 Mobile (rev a2)
Drivers tried: 430.40 / 440.36 / 440.44 / 440.59

Crash cause: BUG: unable to handle kernel paging request at {0000000000006c00} (nvidia_modeset)

Please let me know how to escalate this.

@CSC-i , Are you a RHEL customer? At any case, please do open a service request. I’ve been getting nothing -but- blissful ignorance from NVidia forums.

I work for Red Hat but since this is my personal homelab, I didn’t want to escalate this using TSANet because it’s not only a customer-impacting issue.
If you are a RHEL customer, please do open an SR with NVidia and reference my (multiple) posts so that we can escalate this to NVidia through TSANet.

On a side note: Not a single crash since I’ve been on 440.64 but it’s too early to say.

Thanks Vinceto.

I’ll try 440.64 on one of the failing machines and if that continues to exhibit the same issues will open an SR with Nividia.

Hi, if you have the capability to open an SR with NVidia, that would be the best option. I tried raising awareness here (RedHat) but since I’m only a consultant and I didn’t have a Customer being hit by the issue, it went nowhere.
PIf you have a support contract with RedHat too, please try opening an SR too and reference the NVidia SR so RH can work with NVidia engineering too.
Thank you,
Vincent

Hello,

We are having a similar problem and I’m wondering if you ever got this resolved?

Thank you.