Nvidia driver (545.23.08) core dump on CentOS 8 Stream ppc64le

I’m looking for some suggestions on troubleshooting an nvidia driver core dump on a recently installed Power9 server running CentOS 8 Stream with a V100 device.

System configuration:

OS information

cat /etc/redhat-release

CentOS Stream release 8

uname -a

Linux hostname 4.18.0-536.el8.ppc64le #1 SMP Thu Jan 18 15:15:31 UTC 2024 ppc64le ppc64le ppc64le GNU/Linux

lspci

0000:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0001:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0002:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0002:01:00.0 Serial Attached SCSI controller: Adaptec Series 8 12G SAS/PCIe 3 (rev 01)
0003:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0004:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0004:01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0004:01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0005:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0005:01:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
0005:02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
0030:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0031:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0033:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0033:01:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)

Nvidia driver detail:
Driver detail:
[root@kilenc ~]# modinfo nvidia
filename: /lib/modules/4.18.0-536.el8.ppc64le/extra/nvidia.ko.xz
alias: char-major-195-*
version: 545.23.08
supported: external
license: NVIDIA
firmware: nvidia/545.23.08/gsp_tu10x.bin
firmware: nvidia/545.23.08/gsp_ga10x.bin
rhelversion: 8.10
srcversion: F35C34FEF96394FD21C4C5C
alias: pci:v000010DEdsvsdbc06sc80i00
alias: pci:v000010DEdsvsdbc03sc02i00
alias: pci:v000010DEdsvsdbc03sc00i00
depends: drm
name: nvidia
vermagic: 4.18.0-536.el8.ppc64le SMP mod_unload modversions mprofile-kernel relocatable
sig_id: PKCS#7
signer: DKMS module signing key
sig_key: 4F:D4:7F:C9:C2:DC:BC:18:88:EB:20:AA:B3:9E:80:62:23:D5:C0:2A
sig_hashalgo: sha256
signature: A8:F0:B0:66:B4:D2:14:86:AB:6E:51:7F:12:26:D6:CA:B6:55:6A:E3:
C0:D9:4C:52:A5:3F:03:B0:03:FB:40:AA:75:BE:84:2B:B9:86:E5:E8:
59:3A:D1:63:F6:1D:B2:ED:7C:DF:F4:CC:6D:F8:D4:77:57:3B:30:73:
BF:01:2F:8F:64:6B:02:31:7C:C3:77:BB:E0:B1:31:3E:CF:EE:C2:DF:
72:96:1A:89:E3:9C:84:CC:91:11:F1:87:73:3E:81:76:D2:9D:BE:B4:
39:EB:01:6A:32:62:0B:CC:39:15:2E:BA:56:5A:4C:54:D7:43:00:72:
B5:17:18:8B:10:08:8C:A2:78:42:F9:F2:E1:8E:C4:BF:61:EB:DF:E2:
4A:4F:4D:7B:FD:E7:23:11:BA:CA:4A:1D:54:EE:B9:E5:A0:4C:32:45:
D7:4E:85:45:E2:D8:18:04:68:45:22:AC:9A:BB:BB:D2:BA:A9:B1:E9:
26:64:92:E4:05:EE:71:C3:18:F4:53:94:98:94:18:10:B4:68:43:F2:
73:E7:06:DD:21:E2:70:1E:AB:2A:0B:DC:5F:36:F1:53:F0:0C:E8:3B:
35:7E:91:5C:85:20:0D:AE:E0:3C:F8:0D:4E:A0:45:9E:0C:D4:DF:19:
43:8E:B7:11:83:B0:F7:DC:37:D1:90:97:78:9E:2F:E9
parm: NvSwitchRegDwords:NvSwitch regkey (charp)
parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid…] (charp)
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_TCEBypassMode:int
parm: NVreg_EnableStreamMemOPs:int
parm: NVreg_RestrictProfilingToAdminUsers:int
parm: NVreg_PreserveVideoMemoryAllocations:int
parm: NVreg_EnableS0ixPowerManagement:int
parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int
parm: NVreg_DynamicPowerManagement:int
parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int
parm: NVreg_EnableGpuFirmware:int
parm: NVreg_EnableGpuFirmwareLogs:int
parm: NVreg_OpenRmEnableUnsupportedGpus:int
parm: NVreg_EnableUserNUMAManagement:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_KMallocHeapMaxSize:int
parm: NVreg_VMallocHeapMaxSize:int
parm: NVreg_IgnoreMMIOCheck:int
parm: NVreg_NvLinkDisable:int
parm: NVreg_EnablePCIERelaxedOrderingMode:int
parm: NVreg_RegisterPCIDriver:int
parm: NVreg_EnableResizableBar:int
parm: NVreg_EnableDbgBreakpoint:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RegistryDwordsPerDevice:charp
parm: NVreg_RmMsg:charp
parm: NVreg_GpuBlacklist:charp
parm: NVreg_TemporaryFilePath:charp
parm: NVreg_ExcludedGpus:charp
parm: NVreg_DmaRemapPeerMmio:int
parm: NVreg_RmNvlinkBandwidth:charp
parm: rm_firmware_active:charp

Core dump of nvidia driver on startup:
Error:
[ 28.351059] ------------[ cut here ]------------
[ 28.351093] unexpected DMA address compression (0x800200081e00000, 0x800080081e00000)
[ 28.351141] WARNING: CPU: 81 PID: 5213 at /var/lib/dkms/nvidia/545.23.08/build/nvidia/nv-dma.c:405 nv_dma_nvlink_addr_compress.isra.2+0xac/0x170 [nvidia]
[ 28.351543] Modules linked in: i2c_dev nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(OE) nvidia(POE) zfs(POE) spl(OE) xts vmx_crypto ofpart ipmi_powernv ipmi_devintf ses powernv_flash ipmi_msghandler enclosure ibmpowernv mtd at24 scsi_transport_sas opal_prd uio_pdrv_genirq uio xfs libcrc32c sd_mod t10_pi sg ast drm_shmem_helper i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt drm tg3 aacraid drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod fuse
[ 28.351804] CPU: 81 PID: 5213 Comm: sbatchd Kdump: loaded Tainted: P OE -------- - - 4.18.0-536.el8.ppc64le #1
[ 28.351845] NIP: c00800000da0b3a4 LR: c00800000da0b3a0 CTR: 0000000000000000
[ 28.351893] REGS: c000000065a92910 TRAP: 0700 Tainted: P OE -------- - - (4.18.0-536.el8.ppc64le)
[ 28.351940] MSR: 900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48004222 XER: 00000000
[ 28.352002] CFAR: c0000000001716a4 IRQMASK: 0
GPR00: c00800000da0b3a0 c000000065a92ba0 c00800000eb98b00 0000000000000049
GPR04: ffffffffffffffea c000000001cf5488 00000000ffff7fff 0000000000000027
GPR08: 0000000000000023 0000000000000001 0000000000000027 80000000ffff8000
GPR12: 0000000000004000 c0002003ff69d780 0000000000000001 00007fff850d1d48
GPR16: 0000000000000000 00007fff8445f3c0 00007fff8445f500 c000000065724008
GPR20: 0000000000000000 0000000000000000 c000200016d01000 c0080000111c2120
GPR24: c000200013be33b0 c000000065a92dc0 0000000000000001 0000000000000000
GPR28: 0000000000000001 0000000000000001 c000200013be9cc8 c000200016d01808
[ 28.352420] NIP [c00800000da0b3a4] nv_dma_nvlink_addr_compress.isra.2+0xac/0x170 [nvidia]
[ 28.352686] LR [c00800000da0b3a0] nv_dma_nvlink_addr_compress.isra.2+0xa8/0x170 [nvidia]
[ 28.353009] Call Trace:
[ 28.353025] [c000000065a92ba0] [c00800000da0b3a0] nv_dma_nvlink_addr_compress.isra.2+0xa8/0x170 [nvidia] (unreliable)
[ 28.353397] [c000000065a92c00] [c00800000da0c868] nv_dma_map_pages+0x2d0/0x370 [nvidia]
[ 28.353755] [c000000065a92cb0] [c00800000da0cdc0] nv_dma_map_alloc+0x1f8/0x390 [nvidia]
[ 28.354140] [c000000065a92d60] [c00800000e952430] _nv038949rm+0x2c0/0x640 [nvidia]
[ 28.354478] [c000000065a92e30] [c00800000da9e6e0] _nv031073rm+0x2f0/0x510 [nvidia]
[ 28.354787] [c000000065a92ee0] [c00800000e09e38c] _nv034566rm+0xdc/0x430 [nvidia]
[ 28.355320] [c000000065a92f90] [c00800000e09f298] _nv012386rm+0x738/0x950 [nvidia]
[ 28.355840] [c000000065a93140] [c00800000e09f7c0] _nv034522rm+0x310/0xb10 [nvidia]
[ 28.356351] [c000000065a931f0] [c00800000e589d30] _nv014828rm+0x200/0x8d0 [nvidia]
[ 28.356835] [c000000065a93350] [c00800000e5847d4] _nv014819rm+0xb4/0xe0 [nvidia]
[ 28.357324] [c000000065a93390] [c00800000e8637c4] _nv006395rm+0x34/0x60 [nvidia]
[ 28.357727] [c000000065a933c0] [c00800000e5ac3c8] _nv021861rm+0xb8/0x150 [nvidia]
[ 28.358210] [c000000065a93440] [c00800000e0b5a04] _nv025510rm+0x3b4/0xa40 [nvidia]
[ 28.358761] [c000000065a93520] [c00800000e0ed2b8] _nv025511rm+0x118/0x210 [nvidia]
[ 28.359336] [c000000065a935b0] [c00800000da80188] _nv025755rm+0xb8/0x120 [nvidia]
[ 28.359663] [c000000065a935f0] [c00800000e960454] _nv000657rm+0xff4/0x2170 [nvidia]
[ 28.359996] [c000000065a93780] [c00800000e955d64] rm_init_adapter+0x114/0x130 [nvidia]
[ 28.360333] [c000000065a93870] [c00800000da01244] nv_start_device+0x47c/0x8d0 [nvidia]
[ 28.360655] [c000000065a93920] [c00800000da0174c] nv_open_device+0xb4/0x1f0 [nvidia]
[ 28.360978] [c000000065a939a0] [c00800000da02160] nvidia_open+0x2f8/0x510 [nvidia]
[ 28.361341] [c000000065a93a50] [c00000000059d000] chrdev_open+0x180/0x3c0
[ 28.361356] [c000000065a93ac0] [c00000000058767c] do_dentry_open+0x27c/0x530
[ 28.361383] [c000000065a93b10] [c0000000005ad0b8] do_last+0x1c8/0xb60
[ 28.361394] [c000000065a93be0] [c0000000005b14b4] path_openat+0x124/0x410
[ 28.361416] [c000000065a93c70] [c0000000005b3970] do_filp_open+0x90/0x170
[ 28.361428] [c000000065a93da0] [c00000000058b258] sys_openat+0x288/0x3a0
[ 28.361474] [c000000065a93e20] [c00000000000b408] system_call+0x5c/0x70
[ 28.361513] Instruction dump:
[ 28.361546] 892a0000 2f890000 4c9e0020 7c0802a6 f8010010 f821ffa1 39200001 3c620000
[ 28.361572] e86385e8 992a0000 490962e5 e8410018 <0fe00000> 38210060 e8010010 7c0803a6
[ 28.361590] —[ end trace 9060d677214aa191 ]—
[ 28.362450] NVRM: GPU 0033:01:00.0: RmInitAdapter failed! (0x23:0x65:1426)
[ 28.362575] NVRM: GPU 0033:01:00.0: rm_init_adapter failed, device minor number 0
[ 28.466310] NVRM: 0033:01:00.0: DMA address not in addressable range of device (0x800200085310000-0x80020008531ffff, 0x800000000000000-0x80000ffffffffff)

I just want to add that this system is equipped with a V100 PCI device and doesn’t support NVLink.

I should also note that I did follow the power9 specific section in the CUDA Linux install guide here.