Kernel panic due to a NULL pointer dereference at 0000000000002b20

Installed Nvidia Driver 460.32.03 on RHEL7.7(kernel-3.10.0-1062.18.1.el7.x86_64).
GPU is NVIDIA Tesla V100 SXM2 32GB x4.
The system crashes with a kernel panic due to a NULL pointer dereference at 0000000000002b20.

The contents of vmcore-dmsg.txt are as follows(Please check the attached file for details).

[2883648.914647] BUG: unable to handle kernel paging request at 0000000000002b20 [2883648.914676] IP: [] _nv036002rm+0x4/0x70 [nvidia] [2883648.914962] PGD 0 [2883648.914971] Oops: 0000 [#1] SMP

The following is the vmcore analysis result.

crash> bt
PID: 0 TASK: ffff8922d330b150 CPU: 2 COMMAND: “swapper/2”
#0 [ffff89417ee839f0] machine_kexec at ffffffff97665b34
#1 [ffff89417ee83a50] __crash_kexec at ffffffff97722592
#2 [ffff89417ee83b20] crash_kexec at ffffffff97722680
#3 [ffff89417ee83b38] oops_end at ffffffff97d85798
#4 [ffff89417ee83b60] no_context at ffffffff97675bb4
#5 [ffff89417ee83bb0] __bad_area_nosemaphore at ffffffff97675e82
#6 [ffff89417ee83c00] bad_area_nosemaphore at ffffffff97675fa4
#7 [ffff89417ee83c10] __do_page_fault at ffffffff97d88750
#8 [ffff89417ee83c80] do_page_fault at ffffffff97d88975
#9 [ffff89417ee83cb0] page_fault at ffffffff97d84778
[exception RIP: _nv036002rm+4]
RIP: ffffffffc111c664 RSP: ffff89417ee83d68 RFLAGS: 00010092
RAX: ffff8926fb04eb28 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000002b20
RBP: ffff894160e1af00 R8: 0000000000000000 R9: ffff89417ee93900
R10: 0000000000000004 R11: 0000000000000005 R12: ffff8926fb04eb28
R13: 0000000000000000 R14: 00000000ec8fff5e R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff89417ee83d70] _nv035997rm at ffffffffc111cd3c [nvidia]
#11 [ffff89417ee83da0] _nv009219rm at ffffffffc0cee761 [nvidia]
#12 [ffff89417ee83dd0] _nv036101rm at ffffffffc0cef57c [nvidia]
#13 [ffff89417ee83df0] _nv032953rm at ffffffffc0d38883 [nvidia]
#14 [ffff89417ee83e20] rm_run_rc_callback at ffffffffc15784e6 [nvidia]
#15 [ffff89417ee83e40] nvidia_rc_timer_callback at ffffffffc0cadfdc [nvidia]
#16 [ffff89417ee83e58] nv_timer_callback_typed_data at ffffffffc0cad47d [nvidia]
#17 [ffff89417ee83e68] call_timer_fn at ffffffff976ac488
#18 [ffff89417ee83ea0] run_timer_softirq at ffffffff976ae8ed
#19 [ffff89417ee83f18] __do_softirq at ffffffff976a5435
#20 [ffff89417ee83f88] call_softirq at ffffffff97d9142c
#21 [ffff89417ee83fa0] do_softirq at ffffffff9762f715
#22 [ffff89417ee83fc0] irq_exit at ffffffff976a57b5
#23 [ffff89417ee83fd8] smp_apic_timer_interrupt at ffffffff97d929d8
#24 [ffff89417ee83ff0] apic_timer_interrupt at ffffffff97d8eefa
— —
#25 [ffff8922d331fdb8] apic_timer_interrupt at ffffffff97d8eefa
[exception RIP: cpuidle_enter_state+87]
RIP: ffffffff97bc1c27 RSP: ffff8922d331fe60 RFLAGS: 00000206
RAX: 000a3ea9cb57577e RBX: 0000000000015960 RCX: 0000000000000018
RDX: 0000000225c17d03 RSI: ffff8922d331ffd8 RDI: 000a3ea9cb57577e
RBP: ffff8922d331fe88 R8: 00000000000003dc R9: 000000000000001c
R10: 000000000000013b R11: 7fffffffffffffff R12: 0000000000000001
R13: ffffffff976ca42d R14: ffff8922d331fe28 R15: 0000000000000087
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#26 [ffff8922d331fe90] cpuidle_idle_call at ffffffff97bc1d7e
#27 [ffff8922d331fed0] arch_cpu_idle at ffffffff97637c6e
#28 [ffff8922d331fee0] cpu_startup_entry at ffffffff977017da
#29 [ffff8922d331ff28] start_secondary at ffffffff9765a0c7
#30 [ffff8922d331ff50] start_cpu at ffffffff976000d5
crash> mod -t
NAME TAINTS
nvidia_drm POE
overlay T
nvidia POE
nvidia_uvm OE
nvidia_modeset POE

crash> dis -rl ffffffffc111c664
0xffffffffc111c660 <_nv036002rm>: sub $0x8,%rsp
0xffffffffc111c664 <_nv036002rm+4>: mov (%rdi),%r10
crash> dis -rl ffffffffc111cd3c
0xffffffffc111cd10 <_nv035997rm>: push %r14
0xffffffffc111cd12 <_nv035997rm+2>: push %r13
0xffffffffc111cd14 <_nv035997rm+4>: mov %edx,%r14d
0xffffffffc111cd17 <_nv035997rm+7>: push %r12
0xffffffffc111cd19 <_nv035997rm+9>: push %rbx
0xffffffffc111cd1a <_nv035997rm+10>: mov %rdi,%r12
0xffffffffc111cd1d <_nv035997rm+13>: mov %rcx,%r8
0xffffffffc111cd20 <_nv035997rm+16>: xor %edx,%edx
0xffffffffc111cd22 <_nv035997rm+18>: mov %esi,%ecx
0xffffffffc111cd24 <_nv035997rm+20>: sub $0x8,%rsp
0xffffffffc111cd28 <_nv035997rm+24>: mov (%rdi),%rbx
0xffffffffc111cd2b <_nv035997rm+27>: mov %esi,%r13d
0xffffffffc111cd2e <_nv035997rm+30>: xor %esi,%esi
0xffffffffc111cd30 <_nv035997rm+32>: lea 0x2b20(%rbx),%rdi
0xffffffffc111cd37 <_nv035997rm+39>: callq 0xffffffffc111c660 <_nv036002rm>
0xffffffffc111cd3c <_nv035997rm+44>: test %r14d,%eax

crash> rd 0x2b20
rd: invalid kernel virtual address: 58 type: “mm_struct pgd”

It looks %RDI is passed and it is corrupted. So kernel crash occurs.
We couldn’t trace who passes the corrupted RDI.

The stack trace of nvidia-smi is as follows.

crash> ps | grep nvidia-smi
100249 100247 27 ffff8fad01048000 UN 0.0 17964 3968 nvidia-smi

100251 100248 1 ffff8fb978bfc1c0 RU 0.0 17964 3968 nvidia-smi
100262 100261 38 ffff8fb8ba21c1c0 UN 0.0 17964 3932 nvidia-smi
crash> bt 100251
PID: 100251 TASK: ffff8fb978bfc1c0 CPU: 1 COMMAND: “nvidia-smi”
#0 [ffff8fb77ee48e48] crash_nmi_callback at ffffffff8d458017
#1 [ffff8fb77ee48e58] nmi_handle at ffffffff8db8593c
#2 [ffff8fb77ee48eb0] do_nmi at ffffffff8db85b5d
#3 [ffff8fb77ee48ef0] end_repeat_nmi at ffffffff8db84d9c
[exception RIP: native_queued_spin_lock_slowpath+462]
RIP: ffffffff8d51772e RSP: ffff8fb8c2feb730 RFLAGS: 00000002
RAX: 0000000000000001 RBX: 0000000000000282 RCX: 0000000000000001
RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff8fd77bcdce28
RBP: ffff8fb8c2feb730 R8: 0000000000000101 R9: ffff8fb77d3d6180
R10: ffff8f98ffc07500 R11: 000000000000c6e0 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000200 R15: ffff8f9882deabbc
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— —
#4 [ffff8fb8c2feb730] native_queued_spin_lock_slowpath at ffffffff8d51772e
#5 [ffff8fb8c2feb738] queued_spin_lock_slowpath at ffffffff8db754ee
#6 [ffff8fb8c2feb748] _raw_spin_lock_irqsave at ffffffff8db83ba7
#7 [ffff8fb8c2feb760] os_acquire_spinlock at ffffffffc2cc6fd2 [nvidia]
#8 [ffff8fb8c2feb778] _nv035354rm at ffffffffc3426aac [nvidia]
#9 [ffff8fb8c2feb788] _nv009219rm at ffffffffc2cf70d9 [nvidia]
#10 [ffff8fb8c2feb7b8] _nv036101rm at ffffffffc2cf857c [nvidia]
#11 [ffff8fb8c2feb7d8] _nv036309rm at ffffffffc2d61c6a [nvidia]
#12 [ffff8fb8c2feb808] _nv002414rm at ffffffffc34383de [nvidia]
#13 [ffff8fb8c2feb838] _nv003464rm at ffffffffc34385ca [nvidia]
#14 [ffff8fb8c2feb868] _nv036073rm at ffffffffc2d60af1 [nvidia]
#15 [ffff8fb8c2feb888] _nv037876rm at ffffffffc3427e89 [nvidia]
#16 [ffff8fb8c2feb8b8] _nv037878rm at ffffffffc342811d [nvidia]
#17 [ffff8fb8c2feb8e8] _nv036171rm at ffffffffc2d60955 [nvidia]
#18 [ffff8fb8c2feb918] _nv036170rm at ffffffffc2d60811 [nvidia]
#19 [ffff8fb8c2feb948] _nv024237rm at ffffffffc31b93ac [nvidia]
#20 [ffff8fb8c2feb978] _nv021544rm at ffffffffc311669d [nvidia]
#21 [ffff8fb8c2feb9a8] _nv021541rm at ffffffffc3113604 [nvidia]
#22 [ffff8fb8c2feb9d8] _nv021993rm at ffffffffc31159f9 [nvidia]
#23 [ffff8fb8c2feba08] _nv022246rm at ffffffffc2d23671 [nvidia]
#24 [ffff8fb8c2feba68] rm_init_adapter at ffffffffc357f985 [nvidia]
#25 [ffff8fb8c2febb28] nv_open_device at ffffffffc2cb7d01 [nvidia]
#26 [ffff8fb8c2febb90] nvidia_open at ffffffffc2cb85bb [nvidia]
#27 [ffff8fb8c2febbf0] nvidia_frontend_open at ffffffffc2cb6388 [nvidia]
#28 [ffff8fb8c2febc18] chrdev_open at ffffffff8d6501e5
#29 [ffff8fb8c2febc60] do_dentry_open at ffffffff8d648356
#30 [ffff8fb8c2febca8] vfs_open at ffffffff8d64849a
#31 [ffff8fb8c2febcd0] do_last at ffffffff8d6591c6
#32 [ffff8fb8c2febd70] path_openat at ffffffff8d65c05d
#33 [ffff8fb8c2febe08] do_filp_open at ffffffff8d65d9cd
#34 [ffff8fb8c2febee0] do_sys_open at ffffffff8d649924
#35 [ffff8fb8c2febf40] sys_open at ffffffff8d649a3e
#36 [ffff8fb8c2febf50] system_call_fastpath at ffffffff8db8dede
RIP: 00007f9cd089aee0 RSP: 00007fffc76db158 RFLAGS: 00000206
RAX: 0000000000000002 RBX: 00007f9cd00aee88 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000080002 RDI: 00007fffc76db280
RBP: 00007fffc76db280 R8: 0000000000000000 R9: 00007f9cd03072bd
R10: 00007f9cd0cb4740 R11: 0000000000000246 R12: 0000000000000001
R13: 00007fffc76db38c R14: 00007f9cd00aee40 R15: 0000000000000001
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
crash> bt 100249
PID: 100249 TASK: ffff8fad01048000 CPU: 27 COMMAND: “nvidia-smi”
#0 [ffff8fbad759fa08] __schedule at ffffffff8db80d4a
#1 [ffff8fbad759fa98] schedule at ffffffff8db811f9
#2 [ffff8fbad759faa8] schedule_timeout at ffffffff8db7ed01
#3 [ffff8fbad759fb50] __down_common at ffffffff8db80607
#4 [ffff8fbad759fbc0] __down at ffffffff8db8067e
#5 [ffff8fbad759fbd0] down at ffffffff8d4cc2c1
#6 [ffff8fbad759fbf0] nvidia_frontend_open at ffffffffc2cb6353 [nvidia]
#7 [ffff8fbad759fc18] chrdev_open at ffffffff8d6501e5
#8 [ffff8fbad759fc60] do_dentry_open at ffffffff8d648356
#9 [ffff8fbad759fca8] vfs_open at ffffffff8d64849a
#10 [ffff8fbad759fcd0] do_last at ffffffff8d6591c6
#11 [ffff8fbad759fd70] path_openat at ffffffff8d65c05d
#12 [ffff8fbad759fe08] do_filp_open at ffffffff8d65d9cd
#13 [ffff8fbad759fee0] do_sys_open at ffffffff8d649924
#14 [ffff8fbad759ff40] sys_open at ffffffff8d649a3e
#15 [ffff8fbad759ff50] system_call_fastpath at ffffffff8db8dede
RIP: 00007f1fdac78ee0 RSP: 00007ffe43693698 RFLAGS: 00000206
RAX: 0000000000000002 RBX: 00007f1fda48ce88 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000080002 RDI: 00007ffe436937c0
RBP: 00007ffe436937c0 R8: 0000000000000000 R9: 00007f1fda6e52bd
R10: 00007f1fdb092740 R11: 0000000000000246 R12: 0000000000000001
R13: 00007ffe436938cc R14: 00007f1fda48ce40 R15: 0000000000000001
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
crash> bt 100262
PID: 100262 TASK: ffff8fb8ba21c1c0 CPU: 38 COMMAND: “nvidia-smi”
#0 [ffff8fca47c4ba08] __schedule at ffffffff8db80d4a
#1 [ffff8fca47c4ba98] schedule at ffffffff8db811f9
#2 [ffff8fca47c4baa8] schedule_timeout at ffffffff8db7ed01
#3 [ffff8fca47c4bb50] __down_common at ffffffff8db80607
#4 [ffff8fca47c4bbc0] __down at ffffffff8db8067e
#5 [ffff8fca47c4bbd0] down at ffffffff8d4cc2c1
#6 [ffff8fca47c4bbf0] nvidia_frontend_open at ffffffffc2cb6353 [nvidia]
#7 [ffff8fca47c4bc18] chrdev_open at ffffffff8d6501e5
#8 [ffff8fca47c4bc60] do_dentry_open at ffffffff8d648356
#9 [ffff8fca47c4bca8] vfs_open at ffffffff8d64849a
#10 [ffff8fca47c4bcd0] do_last at ffffffff8d6591c6
#11 [ffff8fca47c4bd70] path_openat at ffffffff8d65c05d
#12 [ffff8fca47c4be08] do_filp_open at ffffffff8d65d9cd
#13 [ffff8fca47c4bee0] do_sys_open at ffffffff8d649924
#14 [ffff8fca47c4bf40] sys_open at ffffffff8d649a3e
#15 [ffff8fca47c4bf50] system_call_fastpath at ffffffff8db8dede
RIP: 00007fe91214fee0 RSP: 00007ffde1eca5c8 RFLAGS: 00000206
RAX: 0000000000000002 RBX: 00007fe911963e20 RCX: ffffffffffffffff
RDX: 0000000000000001 RSI: 0000000000000002 RDI: 00007ffde1eca7e0
RBP: 00007ffde1eca7cc R8: 0000000000000000 R9: 00000000000001b6
R10: 00007fe912569740 R11: 0000000000000246 R12: 00007ffde1eca7e0
R13: 00007ffde1ecaa30 R14: 00007ffde1eca7cc R15: 00007ffde1eca920
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
crash>

Does anyone have any idea what would cause this kernel panic?
Do you have any ideas for resolving this issue?
Can you please investigate nvidia.ko from the call trace?

I found an exception RIP: _nv036002rm + 4,
“BUG: unable to handle kernel paging request at 0000000000002b20”
report in the developer forum. Is panic to us the same cause?

Will you let us know the driver version when this issue is fixed?

Exactly the same problem on our platform with Nvidia Driver 460.32.03 installed on RHEL7.9(kernel-3.10.0-1160.41.1.el7.x86_64).
GPUs are NVIDIA Tesla M60 8GB x8.
The system crashes with a kernel panic due to a NULL pointer dereference at 0000000000002b20.
Any idea how to fix it?