CentOS 7 & 8 driver 470.57.02 BUG: unable to handle kernel NULL pointer dereference at 00000000000000b1

We are seeing intermittent (but very annoyingly frequent, eg, 13x on 1 sys in July ) crashes on CentOS 7 3.10.0-1160.31.1.el7.x86_64 with both RTX 8000s/RTX 5000s and CentOS 8 4.18.0-305.12.1.el8_4.x86_64 systems with A6000s GPUs. These are on server boxes with 8 or 10 GPUs installed. I will upload a bug-report from and EL7 and an EL8 as well as the associated vmcore-dmesg.txt files. I’ll just paste the EL7 one here. We are stuck, any help would be very appreciated.

Thanks,
Steve

Stephen Nadas | System Architect | nadas@bu.edu
College of Arts and Sciences | Department of Computer Science
Boston University | (617) 358-8450 | he/him/his

[526138.289501] nvidia 0000:1e:00.0: irq 151 for MSI/MSI-X
[526139.607104] nvidia 0000:1f:00.0: irq 152 for MSI/MSI-X
[526140.424254] nvidia 0000:20:00.0: irq 153 for MSI/MSI-X
[526141.582669] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b1
[526141.582694] IP: [] nv031680rm+0x79/0x940 [nvidia]
[526141.582898] PGD 5f7ca51067 PUD 5d48ae8067 PMD 0
[526141.582910] Oops: 0000 [#1] SMP
[526141.582918] Modules linked in: nfsv3 nfs_acl nfs fscache xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf
$
[526141.583132] cryptd snd_hda_codec_hdmi pcspkr snd_hda_intel snd_hda_codec snd_hda_core snd_seq snd_hwdep snd_seq_device snd_pcm snd_timer joydev snd soundcore sg mei_me lpc_i$
[526141.583307] CPU: 3 PID: 92586 Comm: nvidia-smi Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.31.1.el7.x86_64 #1
[526141.583329] Hardware name: ASUSTeK COMPUTER INC. ESC8000 G4/Z11PG-D24 Series, BIOS 6701 10/28/2020
[526141.583346] task: ffff8c207d47d280 ti: ffff8c20815c0000 task.ti: ffff8c20815c0000
[526141.583360] RIP: 0010:[] [] _nv031680rm+0x79/0x940 [nvidia]
[526141.583562] RSP: 0018:ffff8c20815c38d0 EFLAGS: 00010202
[526141.583572] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000007
[526141.583586] RDX: ffff8c20ab0e8008 RSI: ffff8c207d8cc008 RDI: ffff8c2079b84008
[526141.583600] RBP: ffff8c2087e2ad80 R08: 000000000001f1c0 R09: ffffffffc0c5122e
[526141.583613] R10: ffff8c20bf0df1c0 R11: ffffd14f3d124600 R12: ffff8c2087e2adc8
[526141.583627] R13: 0000000000000001 R14: ffff8c207d8cc008 R15: 0000000000000001
[526141.583642] FS: 00007f3d89308740(0000) GS:ffff8c20bf0c0000(0000) knlGS:0000000000000000
[526141.583657] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[526141.583667] CR2: 00000000000000b1 CR3: 0000005f65028000 CR4: 00000000007607e0
[526141.583681] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[526141.583693] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[526141.583706] PKRU: 55555554
[526141.583710] Call Trace:
[526141.583874] [] ? _nv031794rm+0x82/0x270 [nvidia]
[526141.584044] [] ? _nv031827rm+0x17/0x30 [nvidia]
[526141.584213] [] ? _nv031830rm+0x70/0x70 [nvidia]
[526141.584385] [] ? _nv022802rm+0xc0/0x1b0 [nvidia]
[526141.584554] [] ? _nv022807rm+0x11b/0x230 [nvidia]
[526141.584727] [] ? _nv022807rm+0x211/0x230 [nvidia]
[526141.584896] [] ? _nv022809rm+0x310/0x310 [nvidia]
[526141.584986] [] ? _nv023479rm+0x32d/0x470 [nvidia]
[526141.585075] [] ? _nv023479rm+0x304/0x470 [nvidia]
[526141.585169] [] ? _nv000721rm+0x32a/0x680 [nvidia]
[526141.585291] [] ? _nv000714rm+0x1735/0x22e0 [nvidia]
[526141.585413] [] ? rm_init_adapter+0xc5/0xe0 [nvidia]
[526141.586055] [] ? nv_open_device+0x281/0x860 [nvidia]
[526141.586685] [] ? nvidia_open+0x2db/0x540 [nvidia]
[526141.587308] [] ? nvidia_frontend_open+0x58/0xb0 [nvidia]
[526141.587867] [] ? chrdev_open+0xb5/0x1b0
[526141.588414] [] ? do_dentry_open+0x1e2/0x2d0
[526141.588954] [] ? security_inode_permission+0x22/0x30
[526141.589489] [] ? cdev_put+0x30/0x30
[526141.590022] [] ? vfs_open+0x5a/0xb0
[526141.590536] [] ? may_open+0xa3/0x120
[526141.591033] [] ? do_last+0x1f6/0x1340
[526141.591514] [] ? kmem_cache_alloc_trace+0x1d6/0x200
[526141.592018] [] ? path_openat+0xcd/0x5a0
[526141.592497] [] ? user_path_at_empty+0x72/0xc0
[526141.592971] [] ? __check_object_size+0x1ca/0x250
[526141.593393] [] ? do_filp_open+0x4d/0xb0
[526141.593801] [] ? __alloc_fd+0x47/0x170
[526141.594191] [] ? do_sys_open+0x124/0x220
[526141.594566] [] ? SyS_open+0x1e/0x20
[526141.594925] [] ? system_call_fastpath+0x25/0x2a
[526141.595270] Code: a7 07 00 00 41 bf 01 00 00 00 4c 8d 65 48 31 db 44 89 7d 10 66 0f 1f 44 00 00 41 f6 c5 01 0f 84 90 00 00 00 49 8b 86 20 1a 00 00 <80> b8 b1 00 00 00 00 74 1$
[526141.596044] RIP [] _nv031680rm+0x79/0x940 [nvidia]
[526141.596554] RSP
[526141.596895] CR2: 00000000000000b1

centos7-peterchin8-info (9.7 KB)
peterchin8-2021-08-29-074011-nvidia-bug-report-log.gz (4.4 MB)
peterchin8-2021-08-29-074011-vmcore-dmesg.txt (631.4 KB)
centos8-ivcgpu9-info (15.0 KB)
ivcgpu9-2021-09-01-023324-nvidia-bug-report-log.gz (5.9 MB)
ivcgpu9-2021-09-01-023324-vmcore-dmesg.txt (149.6 KB)
ivcgpu9-2021-09-01-013324-vmcore-dmesg.txt (159.8 KB)