Kernel panic due to a NULL pointer dereference at 0000000000002b20

Installed Nvidia Driver 460.32.03 on RHEL7.7(kernel-3.10.0-1062.18.1.el7.x86_64).
GPU is NVIDIA Tesla V100 SXM2 32GB x4.
The system crashes with a kernel panic due to a NULL pointer dereference at 0000000000002b20.

The contents of vmcore-dmsg.txt are as follows(Please check the attached file for details).

[2883648.914647] BUG: unable to handle kernel paging request at 0000000000002b20 [2883648.914676] IP: [] _nv036002rm+0x4/0x70 [nvidia] [2883648.914962] PGD 0 [2883648.914971] Oops: 0000 [#1] SMP

The following is the vmcore analysis result.

crash> bt
PID: 0 TASK: ffff8922d330b150 CPU: 2 COMMAND: “swapper/2”
#0 [ffff89417ee839f0] machine_kexec at ffffffff97665b34
#1 [ffff89417ee83a50] __crash_kexec at ffffffff97722592
#2 [ffff89417ee83b20] crash_kexec at ffffffff97722680
#3 [ffff89417ee83b38] oops_end at ffffffff97d85798
#4 [ffff89417ee83b60] no_context at ffffffff97675bb4
#5 [ffff89417ee83bb0] __bad_area_nosemaphore at ffffffff97675e82
#6 [ffff89417ee83c00] bad_area_nosemaphore at ffffffff97675fa4
#7 [ffff89417ee83c10] __do_page_fault at ffffffff97d88750
#8 [ffff89417ee83c80] do_page_fault at ffffffff97d88975
#9 [ffff89417ee83cb0] page_fault at ffffffff97d84778
[exception RIP: _nv036002rm+4]
RIP: ffffffffc111c664 RSP: ffff89417ee83d68 RFLAGS: 00010092
RAX: ffff8926fb04eb28 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000002b20
RBP: ffff894160e1af00 R8: 0000000000000000 R9: ffff89417ee93900
R10: 0000000000000004 R11: 0000000000000005 R12: ffff8926fb04eb28
R13: 0000000000000000 R14: 00000000ec8fff5e R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff89417ee83d70] _nv035997rm at ffffffffc111cd3c [nvidia]
#11 [ffff89417ee83da0] _nv009219rm at ffffffffc0cee761 [nvidia]
#12 [ffff89417ee83dd0] _nv036101rm at ffffffffc0cef57c [nvidia]
#13 [ffff89417ee83df0] _nv032953rm at ffffffffc0d38883 [nvidia]
#14 [ffff89417ee83e20] rm_run_rc_callback at ffffffffc15784e6 [nvidia]
#15 [ffff89417ee83e40] nvidia_rc_timer_callback at ffffffffc0cadfdc [nvidia]
#16 [ffff89417ee83e58] nv_timer_callback_typed_data at ffffffffc0cad47d [nvidia]
#17 [ffff89417ee83e68] call_timer_fn at ffffffff976ac488
#18 [ffff89417ee83ea0] run_timer_softirq at ffffffff976ae8ed
#19 [ffff89417ee83f18] __do_softirq at ffffffff976a5435
#20 [ffff89417ee83f88] call_softirq at ffffffff97d9142c
#21 [ffff89417ee83fa0] do_softirq at ffffffff9762f715
#22 [ffff89417ee83fc0] irq_exit at ffffffff976a57b5
#23 [ffff89417ee83fd8] smp_apic_timer_interrupt at ffffffff97d929d8
#24 [ffff89417ee83ff0] apic_timer_interrupt at ffffffff97d8eefa
— —
#25 [ffff8922d331fdb8] apic_timer_interrupt at ffffffff97d8eefa
[exception RIP: cpuidle_enter_state+87]
RIP: ffffffff97bc1c27 RSP: ffff8922d331fe60 RFLAGS: 00000206
RAX: 000a3ea9cb57577e RBX: 0000000000015960 RCX: 0000000000000018
RDX: 0000000225c17d03 RSI: ffff8922d331ffd8 RDI: 000a3ea9cb57577e
RBP: ffff8922d331fe88 R8: 00000000000003dc R9: 000000000000001c
R10: 000000000000013b R11: 7fffffffffffffff R12: 0000000000000001
R13: ffffffff976ca42d R14: ffff8922d331fe28 R15: 0000000000000087
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#26 [ffff8922d331fe90] cpuidle_idle_call at ffffffff97bc1d7e
#27 [ffff8922d331fed0] arch_cpu_idle at ffffffff97637c6e
#28 [ffff8922d331fee0] cpu_startup_entry at ffffffff977017da
#29 [ffff8922d331ff28] start_secondary at ffffffff9765a0c7
#30 [ffff8922d331ff50] start_cpu at ffffffff976000d5
crash> mod -t
NAME TAINTS
nvidia_drm POE
overlay T
nvidia POE
nvidia_uvm OE
nvidia_modeset POE

crash> dis -rl ffffffffc111c664
0xffffffffc111c660 <_nv036002rm>: sub $0x8,%rsp
0xffffffffc111c664 <_nv036002rm+4>: mov (%rdi),%r10
crash> dis -rl ffffffffc111cd3c
0xffffffffc111cd10 <_nv035997rm>: push %r14
0xffffffffc111cd12 <_nv035997rm+2>: push %r13
0xffffffffc111cd14 <_nv035997rm+4>: mov %edx,%r14d
0xffffffffc111cd17 <_nv035997rm+7>: push %r12
0xffffffffc111cd19 <_nv035997rm+9>: push %rbx
0xffffffffc111cd1a <_nv035997rm+10>: mov %rdi,%r12
0xffffffffc111cd1d <_nv035997rm+13>: mov %rcx,%r8
0xffffffffc111cd20 <_nv035997rm+16>: xor %edx,%edx
0xffffffffc111cd22 <_nv035997rm+18>: mov %esi,%ecx
0xffffffffc111cd24 <_nv035997rm+20>: sub $0x8,%rsp
0xffffffffc111cd28 <_nv035997rm+24>: mov (%rdi),%rbx
0xffffffffc111cd2b <_nv035997rm+27>: mov %esi,%r13d
0xffffffffc111cd2e <_nv035997rm+30>: xor %esi,%esi
0xffffffffc111cd30 <_nv035997rm+32>: lea 0x2b20(%rbx),%rdi
0xffffffffc111cd37 <_nv035997rm+39>: callq 0xffffffffc111c660 <_nv036002rm>
0xffffffffc111cd3c <_nv035997rm+44>: test %r14d,%eax

crash> rd 0x2b20
rd: invalid kernel virtual address: 58 type: “mm_struct pgd”

It looks %RDI is passed and it is corrupted. So kernel crash occurs.
We couldn’t trace who passes the corrupted RDI.

The stack trace of nvidia-smi is as follows.

crash> ps | grep nvidia-smi
100249 100247 27 ffff8fad01048000 UN 0.0 17964 3968 nvidia-smi

100251 100248 1 ffff8fb978bfc1c0 RU 0.0 17964 3968 nvidia-smi
100262 100261 38 ffff8fb8ba21c1c0 UN 0.0 17964 3932 nvidia-smi
crash> bt 100251
PID: 100251 TASK: ffff8fb978bfc1c0 CPU: 1 COMMAND: “nvidia-smi”
#0 [ffff8fb77ee48e48] crash_nmi_callback at ffffffff8d458017
#1 [ffff8fb77ee48e58] nmi_handle at ffffffff8db8593c
#2 [ffff8fb77ee48eb0] do_nmi at ffffffff8db85b5d
#3 [ffff8fb77ee48ef0] end_repeat_nmi at ffffffff8db84d9c
[exception RIP: native_queued_spin_lock_slowpath+462]
RIP: ffffffff8d51772e RSP: ffff8fb8c2feb730 RFLAGS: 00000002
RAX: 0000000000000001 RBX: 0000000000000282 RCX: 0000000000000001
RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff8fd77bcdce28
RBP: ffff8fb8c2feb730 R8: 0000000000000101 R9: ffff8fb77d3d6180
R10: ffff8f98ffc07500 R11: 000000000000c6e0 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000200 R15: ffff8f9882deabbc
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— —
#4 [ffff8fb8c2feb730] native_queued_spin_lock_slowpath at ffffffff8d51772e
#5 [ffff8fb8c2feb738] queued_spin_lock_slowpath at ffffffff8db754ee
#6 [ffff8fb8c2feb748] _raw_spin_lock_irqsave at ffffffff8db83ba7
#7 [ffff8fb8c2feb760] os_acquire_spinlock at ffffffffc2cc6fd2 [nvidia]
#8 [ffff8fb8c2feb778] _nv035354rm at ffffffffc3426aac [nvidia]
#9 [ffff8fb8c2feb788] _nv009219rm at ffffffffc2cf70d9 [nvidia]
#10 [ffff8fb8c2feb7b8] _nv036101rm at ffffffffc2cf857c [nvidia]
#11 [ffff8fb8c2feb7d8] _nv036309rm at ffffffffc2d61c6a [nvidia]
#12 [ffff8fb8c2feb808] _nv002414rm at ffffffffc34383de [nvidia]
#13 [ffff8fb8c2feb838] _nv003464rm at ffffffffc34385ca [nvidia]
#14 [ffff8fb8c2feb868] _nv036073rm at ffffffffc2d60af1 [nvidia]
#15 [ffff8fb8c2feb888] _nv037876rm at ffffffffc3427e89 [nvidia]
#16 [ffff8fb8c2feb8b8] _nv037878rm at ffffffffc342811d [nvidia]
#17 [ffff8fb8c2feb8e8] _nv036171rm at ffffffffc2d60955 [nvidia]
#18 [ffff8fb8c2feb918] _nv036170rm at ffffffffc2d60811 [nvidia]
#19 [ffff8fb8c2feb948] _nv024237rm at ffffffffc31b93ac [nvidia]
#20 [ffff8fb8c2feb978] _nv021544rm at ffffffffc311669d [nvidia]
#21 [ffff8fb8c2feb9a8] _nv021541rm at ffffffffc3113604 [nvidia]
#22 [ffff8fb8c2feb9d8] _nv021993rm at ffffffffc31159f9 [nvidia]
#23 [ffff8fb8c2feba08] _nv022246rm at ffffffffc2d23671 [nvidia]
#24 [ffff8fb8c2feba68] rm_init_adapter at ffffffffc357f985 [nvidia]
#25 [ffff8fb8c2febb28] nv_open_device at ffffffffc2cb7d01 [nvidia]
#26 [ffff8fb8c2febb90] nvidia_open at ffffffffc2cb85bb [nvidia]
#27 [ffff8fb8c2febbf0] nvidia_frontend_open at ffffffffc2cb6388 [nvidia]
#28 [ffff8fb8c2febc18] chrdev_open at ffffffff8d6501e5
#29 [ffff8fb8c2febc60] do_dentry_open at ffffffff8d648356
#30 [ffff8fb8c2febca8] vfs_open at ffffffff8d64849a
#31 [ffff8fb8c2febcd0] do_last at ffffffff8d6591c6
#32 [ffff8fb8c2febd70] path_openat at ffffffff8d65c05d
#33 [ffff8fb8c2febe08] do_filp_open at ffffffff8d65d9cd
#34 [ffff8fb8c2febee0] do_sys_open at ffffffff8d649924
#35 [ffff8fb8c2febf40] sys_open at ffffffff8d649a3e
#36 [ffff8fb8c2febf50] system_call_fastpath at ffffffff8db8dede
RIP: 00007f9cd089aee0 RSP: 00007fffc76db158 RFLAGS: 00000206
RAX: 0000000000000002 RBX: 00007f9cd00aee88 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000080002 RDI: 00007fffc76db280
RBP: 00007fffc76db280 R8: 0000000000000000 R9: 00007f9cd03072bd
R10: 00007f9cd0cb4740 R11: 0000000000000246 R12: 0000000000000001
R13: 00007fffc76db38c R14: 00007f9cd00aee40 R15: 0000000000000001
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
crash> bt 100249
PID: 100249 TASK: ffff8fad01048000 CPU: 27 COMMAND: “nvidia-smi”
#0 [ffff8fbad759fa08] __schedule at ffffffff8db80d4a
#1 [ffff8fbad759fa98] schedule at ffffffff8db811f9
#2 [ffff8fbad759faa8] schedule_timeout at ffffffff8db7ed01
#3 [ffff8fbad759fb50] __down_common at ffffffff8db80607
#4 [ffff8fbad759fbc0] __down at ffffffff8db8067e
#5 [ffff8fbad759fbd0] down at ffffffff8d4cc2c1
#6 [ffff8fbad759fbf0] nvidia_frontend_open at ffffffffc2cb6353 [nvidia]
#7 [ffff8fbad759fc18] chrdev_open at ffffffff8d6501e5
#8 [ffff8fbad759fc60] do_dentry_open at ffffffff8d648356
#9 [ffff8fbad759fca8] vfs_open at ffffffff8d64849a
#10 [ffff8fbad759fcd0] do_last at ffffffff8d6591c6
#11 [ffff8fbad759fd70] path_openat at ffffffff8d65c05d
#12 [ffff8fbad759fe08] do_filp_open at ffffffff8d65d9cd
#13 [ffff8fbad759fee0] do_sys_open at ffffffff8d649924
#14 [ffff8fbad759ff40] sys_open at ffffffff8d649a3e
#15 [ffff8fbad759ff50] system_call_fastpath at ffffffff8db8dede
RIP: 00007f1fdac78ee0 RSP: 00007ffe43693698 RFLAGS: 00000206
RAX: 0000000000000002 RBX: 00007f1fda48ce88 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000080002 RDI: 00007ffe436937c0
RBP: 00007ffe436937c0 R8: 0000000000000000 R9: 00007f1fda6e52bd
R10: 00007f1fdb092740 R11: 0000000000000246 R12: 0000000000000001
R13: 00007ffe436938cc R14: 00007f1fda48ce40 R15: 0000000000000001
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
crash> bt 100262
PID: 100262 TASK: ffff8fb8ba21c1c0 CPU: 38 COMMAND: “nvidia-smi”
#0 [ffff8fca47c4ba08] __schedule at ffffffff8db80d4a
#1 [ffff8fca47c4ba98] schedule at ffffffff8db811f9
#2 [ffff8fca47c4baa8] schedule_timeout at ffffffff8db7ed01
#3 [ffff8fca47c4bb50] __down_common at ffffffff8db80607
#4 [ffff8fca47c4bbc0] __down at ffffffff8db8067e
#5 [ffff8fca47c4bbd0] down at ffffffff8d4cc2c1
#6 [ffff8fca47c4bbf0] nvidia_frontend_open at ffffffffc2cb6353 [nvidia]
#7 [ffff8fca47c4bc18] chrdev_open at ffffffff8d6501e5
#8 [ffff8fca47c4bc60] do_dentry_open at ffffffff8d648356
#9 [ffff8fca47c4bca8] vfs_open at ffffffff8d64849a
#10 [ffff8fca47c4bcd0] do_last at ffffffff8d6591c6
#11 [ffff8fca47c4bd70] path_openat at ffffffff8d65c05d
#12 [ffff8fca47c4be08] do_filp_open at ffffffff8d65d9cd
#13 [ffff8fca47c4bee0] do_sys_open at ffffffff8d649924
#14 [ffff8fca47c4bf40] sys_open at ffffffff8d649a3e
#15 [ffff8fca47c4bf50] system_call_fastpath at ffffffff8db8dede
RIP: 00007fe91214fee0 RSP: 00007ffde1eca5c8 RFLAGS: 00000206
RAX: 0000000000000002 RBX: 00007fe911963e20 RCX: ffffffffffffffff
RDX: 0000000000000001 RSI: 0000000000000002 RDI: 00007ffde1eca7e0
RBP: 00007ffde1eca7cc R8: 0000000000000000 R9: 00000000000001b6
R10: 00007fe912569740 R11: 0000000000000246 R12: 00007ffde1eca7e0
R13: 00007ffde1ecaa30 R14: 00007ffde1eca7cc R15: 00007ffde1eca920
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
crash>

Does anyone have any idea what would cause this kernel panic?
Do you have any ideas for resolving this issue?
Can you please investigate nvidia.ko from the call trace?

I found an exception RIP: _nv036002rm + 4,
“BUG: unable to handle kernel paging request at 0000000000002b20”
report in the developer forum. Is panic to us the same cause?

Will you let us know the driver version when this issue is fixed?

Exactly the same problem on our platform with Nvidia Driver 460.32.03 installed on RHEL7.9(kernel-3.10.0-1160.41.1.el7.x86_64).
GPUs are NVIDIA Tesla M60 8GB x8.
The system crashes with a kernel panic due to a NULL pointer dereference at 0000000000002b20.
Any idea how to fix it?

We have about the same problem on several servers.
In between we have updated the drivers a few times, but the problem persists.
Driver version: 470.74
CentOS 7.9.2009 (3.10.0-1160.42.2.el7.x86_64)
GPUs:
NVIDIA RTX 2080 Ti x8. (But on other computing nodes we have some other GPUs as well - same results).
The symptoms started a few months ago (we regularly update the drivers)

vmcore-dmesg.txt:

[ 4390.140350] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b1
[ 4390.140392] IP: [<ffffffffc2e229f9>] _nv031699rm+0x79/0x940 [nvidia]
[ 4390.140704] PGD 8000002eb74b2067 PUD 1da8038067 PMD 0
[ 4390.140726] Oops: 0000 [#1] SMP
[ 4390.140740] Modules linked in: nvidia_uvm(OE) squashfs loop overlay(T) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache bonding sunrpc iTCO_wdt iTCO_vendor_support skx_edac vfat intel_powerclamp fat coretemp intel_rapl snd_hda_codec_hdmi iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd snd_hda_intel pcspkr snd_hda_codec snd_seq snd_hda_core snd_seq_device snd_hwdep snd_pcm snd_timer snd soundcore sg lpc_ich joydev i2c_i801 mei_me mei ipmi_si ipmi_devintf ipmi_msghandler dm_multipath acpi_power_meter acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper i40e ttm ahci crct10dif_pclmul syscopyarea sysfillrect crct10dif_common
[ 4390.141060]  sysimgblt fb_sys_fops crc32c_intel libahci drm libata ptp pps_core drm_panel_orientation_quirks wmi nfit libnvdimm dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvidia]
[ 4390.141134] CPU: 2 PID: 54583 Comm: python3 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1160.42.2.el7.x86_64 #1
[ 4390.141169] Hardware name: Supermicro SYS-4029GP-TRT/X11DPG-OT-CPU, BIOS 3.0c 04/09/2019
[ 4390.141195] task: ffff907f714d5280 ti: ffff907e4f354000 task.ti: ffff907e4f354000
[ 4390.141219] RIP: 0010:[<ffffffffc2e229f9>]  [<ffffffffc2e229f9>] _nv031699rm+0x79/0x940 [nvidia]
[ 4390.141528] RSP: 0018:ffff907e4f3578c0  EFLAGS: 00010202
[ 4390.141546] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[ 4390.141569] RDX: ffff90a5fbdec008 RSI: ffff90af7d174008 RDI: ffff90a34e1c4008
[ 4390.141592] RBP: ffff90af66b52d80 R08: ffff90af7a60a000 R09: 0000000180040003
[ 4390.141614] R10: 0000000000000001 R11: ffff90af7a60a000 R12: ffff90af66b52dc8
[ 4390.141636] R13: 0000000000000003 R14: ffff90af7d174008 R15: 0000000000000001
[ 4390.141659] FS:  00007f4228140740(0000) GS:ffff907f7fe80000(0000) knlGS:0000000000000000
[ 4390.141684] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4390.141703] CR2: 00000000000000b1 CR3: 000000211b44e000 CR4: 00000000007607e0
[ 4390.141726] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4390.141748] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4390.141770] PKRU: 55555554
[ 4390.141780] Call Trace:
[ 4390.142055]  [<ffffffffc2e1e262>] ? _nv031813rm+0x82/0x270 [nvidia]
[ 4390.142295]  [<ffffffffc2e1e567>] ? _nv031846rm+0x17/0x30 [nvidia]
[ 4390.142544]  [<ffffffffc2cf9e80>] ? _nv022821rm+0xc0/0x1b0 [nvidia]
[ 4390.142789]  [<ffffffffc2cf7b4b>] ? _nv022826rm+0x11b/0x230 [nvidia]
[ 4390.143033]  [<ffffffffc2cf7c41>] ? _nv022826rm+0x211/0x230 [nvidia]
[ 4390.143276]  [<ffffffffc2cf7a30>] ? _nv022828rm+0x310/0x310 [nvidia]
[ 4390.143409]  [<ffffffffc28d9b5d>] ? _nv023498rm+0x32d/0x470 [nvidia]
[ 4390.143540]  [<ffffffffc28d9b34>] ? _nv023498rm+0x304/0x470 [nvidia]
[ 4390.143676]  [<ffffffffc290eaea>] ? _nv000722rm+0x32a/0x680 [nvidia]
[ 4390.143858]  [<ffffffffc32415a2>] ? _nv000715rm+0x1802/0x23d0 [nvidia]
[ 4390.144040]  [<ffffffffc3239c45>] ? rm_init_adapter+0xc5/0xe0 [nvidia]
[ 4390.144151]  [<ffffffffc2892741>] ? nv_open_device+0x281/0x860 [nvidia]
[ 4390.144262]  [<ffffffffc2892ffb>] ? nvidia_open+0x2db/0x540 [nvidia]
[ 4390.144376]  [<ffffffffc28a6038>] ? nvidia_frontend_open+0x58/0xb0 [nvidia]
[ 4390.145222]  [<ffffffffb5053be5>] ? chrdev_open+0xb5/0x1b0
[ 4390.146038]  [<ffffffffb504bc92>] ? do_dentry_open+0x1e2/0x2d0
[ 4390.146848]  [<ffffffffb5108fd2>] ? security_inode_permission+0x22/0x30
[ 4390.147649]  [<ffffffffb5053b30>] ? cdev_put+0x30/0x30
[ 4390.148433]  [<ffffffffb504be1a>] ? vfs_open+0x5a/0xb0
[ 4390.149206]  [<ffffffffb505a323>] ? may_open+0xa3/0x120
[ 4390.149963]  [<ffffffffb505e206>] ? do_last+0x1f6/0x1340
[ 4390.150719]  [<ffffffffb50293a6>] ? kmem_cache_alloc_trace+0x1d6/0x200
[ 4390.151461]  [<ffffffffb505f41d>] ? path_openat+0xcd/0x5a0
[ 4390.152187]  [<ffffffffb5061542>] ? user_path_at_empty+0x72/0xc0
[ 4390.152891]  [<ffffffffb504b11a>] ? __check_object_size+0x1ca/0x250
[ 4390.153570]  [<ffffffffb506166d>] ? do_filp_open+0x4d/0xb0
[ 4390.154228]  [<ffffffffb506f767>] ? __alloc_fd+0x47/0x170
[ 4390.154870]  [<ffffffffb504d404>] ? do_sys_open+0x124/0x220
[ 4390.155486]  [<ffffffffb504d51e>] ? SyS_open+0x1e/0x20
[ 4390.156081]  [<ffffffffb5595f92>] ? system_call_fastpath+0x25/0x2a

I can provide you the vmcore file as well, when needed.

We seem to have the same issue:

Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-91-generic x86_64)
Driver Version: 470.94 and 495.46 (similar errors with both versions)

GPUs: 8x RTX 2080TI

[361211.683423] BUG: kernel NULL pointer dereference, address: 00000000000000b1
[361211.683776] #PF: supervisor read access in kernel mode
[361211.684109] #PF: error_code(0x0000) - not-present page
[361211.684431] PGD 0 P4D 0
[361211.684846] Oops: 0000 [#1] SMP PTI
[361211.685247] CPU: 23 PID: 290403 Comm: python Tainted: P           OE     5.4.0-91-generic #102-Ubuntu
[361211.685679] Hardware name: Supermicro SYS-4029GP-TRT/X11DPG-OT-CPU, BIOS 3.0c 04/09/2019
[361211.686346] RIP: 0010:_nv030807rm+0x79/0x940 [nvidia]
[361211.686710] Code: 07 00 00 41 bf 01 00 00 00 4c 8d 65 48 31 db 44 89 7d 10 66 0f 1f 44 00 00 41 f6 c5 01 0f 84 8d 00 00 00 49 8b 86 20 18 00 00 <80> b8 b1 00 00 00 00 74 12 b8 01 00 00 00 89 d9 d3 e0 41 85 86 84
[361211.687492] RSP: 0000:ffffa7900ee138f0 EFLAGS: 00010202
[361211.687894] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000005
[361211.688307] RDX: ffff96820f7c4008 RSI: ffff9683d538c008 RDI: ffff9682ab17c008
[361211.688805] RBP: ffff9683c7c25db0 R08: ffff9683df8f01c0 R09: ffff966be0406680
[361211.689328] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9683c7c25df8
[361211.689820] R13: 0000000000000003 R14: ffff9683d538c008 R15: 0000000000000001
[361211.690247] FS:  00007fde6c83b340(0000) GS:ffff9683df8c0000(0000) knlGS:0000000000000000
[361211.690682] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[361211.691121] CR2: 00000000000000b1 CR3: 000000133a2f2004 CR4: 00000000007606e0
[361211.691572] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[361211.692025] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[361211.692496] PKRU: 55555554
[361211.693058] Call Trace:
[361211.693909]  ? _nv030918rm+0x79/0x260 [nvidia]
[361211.694582]  ? _nv030952rm+0x10/0x20 [nvidia]
[361211.695229]  ? _nv022367rm+0xd5/0x1a0 [nvidia]
[361211.695882]  ? _nv022372rm+0x17a/0x250 [nvidia]
[361211.696570]  ? _nv022372rm+0x109/0x250 [nvidia]
[361211.697247]  ? _nv022959rm+0x2fa/0x410 [nvidia]
[361211.697878]  ? _nv022959rm+0x2c6/0x410 [nvidia]
[361211.698434]  ? _nv000716rm+0x33a/0x680 [nvidia]
[361211.699044]  ? _nv000709rm+0x1520/0x2160 [nvidia]
[361211.699657]  ? rm_init_adapter+0xc5/0xe0 [nvidia]
[361211.700205]  ? nv_open_device+0x511/0x920 [nvidia]
[361211.700819]  ? nvidia_open+0x2d8/0x580 [nvidia]
[361211.701471]  ? nvidia_frontend_open+0x58/0xa0 [nvidia]
[361211.701969]  ? chrdev_open+0xd3/0x1c0
[361211.702428]  ? cdev_default_release+0x20/0x20
[361211.702893]  ? do_dentry_open+0x143/0x3a0
[361211.703364]  ? vfs_open+0x2d/0x30
[361211.703836]  ? do_last+0x194/0x900
[361211.704302]  ? path_openat+0x8d/0x290
[361211.704853]  ? do_filp_open+0x91/0x100
[361211.705433]  ? __alloc_fd+0x46/0x150
[361211.705952]  ? do_sys_open+0x17e/0x290
[361211.706407]  ? __x64_sys_openat+0x20/0x30
[361211.706857]  ? do_syscall_64+0x57/0x190
[361211.707301]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[361211.707735] Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter bridge stp llc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common ipmi_ssif isst_if_common skx_edac nfit snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm rapl intel_cstate joydev input_leds ucsi_ccg typec_ucsi typec snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore ipmi_si mei_me ioatdma ipmi_devintf dca mei ipmi_msghandler mac_hid acpi_pad acpi_power_meter nvidia_uvm(OE) sch_fq_codel msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nvidia_drm(POE) nvidia_modeset(POE) ast
[361211.707772]  drm_vram_helper i2c_algo_bit nvidia(POE) ttm drm_kms_helper crct10dif_pclmul syscopyarea hid_generic crc32_pclmul sysfillrect sysimgblt fb_sys_fops ghash_clmulni_intel aesni_intel crypto_simd usbhid cryptd i40e i2c_nvidia_gpu drm glue_helper hid lpc_ich i2c_i801 ahci libahci wmi
[361211.713797] CR2: 00000000000000b1
[361211.714321] ---[ end trace 05456c2561cee086 ]---
[361211.780543] RIP: 0010:_nv030807rm+0x79/0x940 [nvidia]
[361211.781210] Code: 07 00 00 41 bf 01 00 00 00 4c 8d 65 48 31 db 44 89 7d 10 66 0f 1f 44 00 00 41 f6 c5 01 0f 84 8d 00 00 00 49 8b 86 20 18 00 00 <80> b8 b1 00 00 00 00 74 12 b8 01 00 00 00 89 d9 d3 e0 41 85 86 84
[361211.782393] RSP: 0000:ffffa7900ee138f0 EFLAGS: 00010202
[361211.782943] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000005
[361211.783494] RDX: ffff96820f7c4008 RSI: ffff9683d538c008 RDI: ffff9682ab17c008
[361211.784051] RBP: ffff9683c7c25db0 R08: ffff9683df8f01c0 R09: ffff966be0406680
[361211.784665] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9683c7c25df8
[361211.785362] R13: 0000000000000003 R14: ffff9683d538c008 R15: 0000000000000001
[361211.785988] FS:  00007fde6c83b340(0000) GS:ffff9683df8c0000(0000) knlGS:0000000000000000
[361211.786559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[361211.787134] CR2: 00000000000000b1 CR3: 000000133a2f2004 CR4: 00000000007606e0
[361211.787719] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[361211.788309] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[361211.789022] PKRU: 55555554

we are facing the similar problem, Whether it is solved? Thanks!

Has anyone been able to resolve this problem? We’re having the same issue here. Thanks