CentOS 8: Kernel oops when running nvidia-smi or nvidia-persistenced on Nvidia A2

Kernel oops looks like this:

[root@localhost ~]# nvidia-persistenced --verbose                                                                                                                                                                              [460/4615]
[   27.162294] BUG: unable to handle kernel NULL pointer dereference at 0000000000000618
[   27.163702] PGD 0 P4D 0
[   27.164179] Oops: 0000 [#1] SMP NOPTI
[   27.164855] CPU: 1 PID: 1336 Comm: nvidia-persiste Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-383.el8.x86_64 #1
[   27.166986] Hardware name: 
[   27.168635] RIP: 0010:_nv021634rm+0x19/0x50 [nvidia]
[   27.170044] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 bf 18 0a 00 00 00 75 1f 80 bf 19 0a 00 00 00 74 2e 48 8b 87 70 1b 00 00 <48> 8b 80 18 06 00 00 c3 0f 1f 80 00 00 00 00 8b 87 dc 08 00 00 48
[   27.173379] RSP: 0018:ffffa2c601383838 EFLAGS: 00010202
[   27.174331] RAX: 0000000000000000 RBX: ffff935958221008 RCX: ffffffffc27f74a0
[   27.175618] RDX: ffff935958630008 RSI: 00000000005d5d9f RDI: ffff935958630008
[   27.176907] RBP: ffff935944c72be0 R08: ffffffffc27f74a0 R09: ffff935944c72c90
[   27.178199] R10: ffff935958630008 R11: 0000000000000001 R12: ffff935958630008
[   27.179484] R13: ffff9359642495d0 R14: 0000000000000000 R15: ffff93596424bc10
[   27.180769] FS:  00007f3247c22840(0000) GS:ffff935a87d00000(0000) knlGS:0000000000000000
[   27.182234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   27.183282] CR2: 0000000000000618 CR3: 0000000107ed4000 CR4: 0000000000350ee0
[   27.184567] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   27.185859] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   27.187145] Call Trace:
[   27.187608]  ? _nv037876rm+0x2e/0x280 [nvidia]                                                                                                                                                                                        [   27.188685]  ? _nv002418rm+0x9/0x20 [nvidia]                                                                                                                                                                                          [   27.189784]  ? _nv003684rm+0x1b/0x70 [nvidia]
[   27.190887]  ? _nv013893rm+0x784/0x7f0 [nvidia]
[   27.192011]  ? _nv034507rm+0xac/0xe0 [nvidia]
[   27.193057]  ? _nv035933rm+0xb0/0x140 [nvidia]
[   27.194169]  ? _nv035932rm+0x30f/0x4f0 [nvidia]
[   27.195300]  ? _nv034418rm+0xbe/0x140 [nvidia]
[   27.196353]  ? _nv034444rm+0x17b/0x180 [nvidia]
[   27.197437]  ? _nv021394rm+0xb7/0x330 [nvidia]
[   27.198680]  ? _nv002290rm+0x9/0x30 [nvidia]
[   27.199769]  ? _nv003684rm+0x1b/0x70 [nvidia]
[   27.200869]  ? _nv009924rm+0x6c/0x1a0 [nvidia]
[   27.201930]  ? _nv022004rm+0x132/0x1d0 [nvidia]
[   27.202979]  ? _nv000696rm+0x1cf/0x2f0 [nvidia]
[   27.204101]  ? _nv000643rm+0x49c/0x20b0 [nvidia]
[   27.205232]  ? rm_init_adapter+0xc5/0xe0 [nvidia]
[   27.206365]  ? nv_start_device+0x323/0x790 [nvidia]
[   27.207455]  ? nv_open_device+0x7b/0x160 [nvidia]
[   27.208509]  ? nvidia_open+0x210/0x570 [nvidia]
[   27.209540]  ? kobj_lookup+0xf1/0x160
[   27.210218]  ? nvidia_frontend_open+0x53/0x90 [nvidia]
[   27.211356]  ? chrdev_open+0xcb/0x1e0
[   27.212033]  ? cdev_default_release+0x20/0x20
[   27.212832]  ? do_dentry_open+0x132/0x340
[   27.213576]  ? path_openat+0x53e/0x14f0
[   27.214286]  ? filename_lookup.part.61+0xe0/0x170
[   27.215144]  ? do_filp_open+0x93/0x100
[   27.215836]  ? getname_flags+0x4a/0x1e0
[   27.216542]  ? __check_object_size+0xa8/0x16b
[   27.217351]  ? do_sys_open+0x184/0x220
[   27.218039]  ? do_syscall_64+0x5b/0x1a0
[   27.218748]  ? entry_SYSCALL_64_after_hwframe+0x65/0xca
[   27.219700] Modules linked in: nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(OE) nvidia(POE) drm_kms_helper intel_rapl_msr intel_rapl_common syscopyarea sysfillrect sysimgblt fb_sys_fops nfit drm libnvdimm iTCO_wdt iTCO_vendor_su
pport coretemp rapl lpc_ich pcspkr vfat fat ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_blk sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi l
ibcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf ipmi_msghandler
[   27.228670] CR2: 0000000000000618

I can’t provide a bug report file because nvidia-bug-report.sh causes the same kernel oops. It very well might be specific to the system it’s on but can’t narrow down if it’s ACPI or PCIe or what it’s related to or what’s missing.

Please provide a dmesg output after boot.

@user161879
Please share bug report, also let us know how frequently are you hitting with issue.
It would also be great to know reliable repro steps which will help me to duplicate issue locally and further debugging.

Dmesg log is here: nvidia_dmesg.log (33.9 KB)

@amrits, I cannot provide a bug report, running the ‘nvidia-bug-report.sh’ tool causes the above kernel oops with no output making it to a log file.

You may not be able to reproduce locally at all, as I’m increasingly convinced this is an issue with the machine I’m trying to use the GPU with rather than the GPU/GPU driver itself. I have confirmed that the GPU works as expected in another computer. I just need information on what is missing/broken to take back to our vendor. The kernel oops is not very helpful in that respect.

Thank you for helping me with this, @generix / @amrits

Some oddities:

[ 0.255192] pci 0000:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 0000:00:01.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)

[ 9.371942] pci 0000:00:01.0: can’t derive routing for PCI INT A
[ 9.373317] nvidia 0000:01:00.0: PCI INT A: not connected

Brand new cpu:

[ 0.146333] smpboot: CPU0: Intel(R) Xeon(R) D-1718T CPU @ 2.60GHz (family: 0x6, model: 0x6c, stepping: 0x1)

I suspect nothing is broken, just the stock 4.18 kernel is too old to get along with the new cpu and its pcie subsystem.
Please check with a recent kernel version.

@generix , I have confirmed this works on the same CPU in a different machine with the same OS installation. The CUDA samples also ran, so I really don’t think it’s a kernel/driver issue as much as a system-level issue like PCIe/ACPI as you pointed out.

I can go to our vendor to fix things like PCI INTx routing, but I’d like to know that’s the issue before going to them.

Really appreciate you taking a look at this so quickly though!

The INT A issue doesn’t look good but might not be fatal. The main issue I guess is

Unknown x0 link

meaning the pcie slot isn’t really working. I don’t know though about the topology of the pcie bus, whether it’s a root bus or a chipset-driven one. Please post the output of
sudo lspci -t
and
sudo lspci -k

Got a new firmware update from the vendor, and it’s resolved now. Turned out to be the PCIe enumeration was done wrong and broke things. Thanks for your help @generix!

Dunno how to close/lock a post here, feel free to close/lock/delete as needed.