Kernel panic seen while CentOS 8.3 linux boot with the adapter Mellanox connectx-5 mcx516A-CDAT

When the card is in place and tried to boot the CentoOS8.3 and encountered with kernel panic with the inbox mlx5 drivers. However, it was working fine with CentOS 8.2.

Below are the logs captured during boot:

[ 106.843633] enic 0000:62:00.0: vNIC csum tx/rx yes/yes tso/lro yes/yes rss yes intr mode any type min timer 125 usec loopback tag 0x0000^M

[ 106.872044] enic 0000:62:00.0: vNIC resources avail: wq 1 rq 1 cq 2 intr 4^M

[ 106.888102] enic 0000:62:00.0: vNIC resources used: wq 1 rq 1 cq 2 intr 4 intr mode MSI-X^M

[Apr 15 15:44:08.788] [ 107.819601] mlx5_core 0000:d8:00.0: enabling device (0140 → 0142)^M

[ 107.834260] mlx5_core 0000:d8:00.0: firmware version: 16.28.4000^M

[ 107.848257] mlx5_core 0000:d8:00.0: 126.016 Gb/s available PCIe bandwidth, limited by 8 GT/s x16 link at 0000:d7:00.0 (capable of 252.048 Gb/s with 16 GT/s x16 link)^M

[Apr 15 15:44:09.119] [ 108.150522] mlx5_core 0000:d8:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps^M

[ 108.171066] mlx5_core 0000:d8:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)^M

[ 108.206017] mlx5_core 0000:d8:00.0: Port module event: module 0, Cable unplugged^M

[ 108.224583] mlx5_core 0000:d8:00.0: mlx5_pcie_event:296:(pid 9): PCIe slot advertised sufficient power (27W).^M

[ 108.230655] BUG: unable to handle kernel NULL pointer dereference at 0000000000000400^M

[ 108.236391] mlx5_core 0000:d8:00.1: enabling device (0140 → 0142)^M

[ 108.236626] mlx5_core 0000:d8:00.1: firmware version: 16.28.4000^M

[ 108.236675] mlx5_core 0000:d8:00.1: 126.016 Gb/s available PCIe bandwidth, limited by 8 GT/s x16 link at 0000:d7:00.0 (capable of 252.048 Gb/s with 16 GT/s x16 link)^M

[ 108.327923] PGD 0 P4D 0 ^M

[ 108.334269] Oops: 0000 [#1] SMP PTI^M

[ 108.342729] CPU: 9 PID: 1879 Comm: kworker/u32:2 Tainted: G ---------r-t - 4.18.0 #1^M

[ 108.363668] Hardware name: Cisco Systems Inc UCSC-C220-M5SX/UCSC-C220-M5SX, BIOS C220M5.4.1.3e.0.1210201720 12/10/2020^M

[ 108.388152] Workqueue: mlx5_hv_vhca mlx5_hv_vhca_invalidate_work [mlx5_core]^M

[ 108.404537] RIP: 0010:hv_read_config_block+0xc4/0x150^M

[ 108.416514] Code: 24 40 83 e2 1f c7 44 24 48 09 00 49 42 09 d0 ba 10 00 00 00 44 89 74 24 4c 89 44 24 50 48 8b 43 38 48 c7 44 24 38 c0 f3 ad b0 <48> 8b b8 00 04 00 00 44 89 64 24 54 e8 3b 6a cb 01 85 c0 74 1f 48^M

[ 108.459739] RSP: 0018:ffffa60283e5bd80 EFLAGS: 00010246^M

[ 108.472168] RAX: 0000000000000000 RBX: ffff8b02bffce038 RCX: ffffa60283e5bdb8^M

[ 108.488829] RDX: 0000000000000010 RSI: ffffa60283e5bdc8 RDI: ffffa60283e5bd88^M

[ 108.499392] mlx5_core 0000:d8:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps^M

[ 108.505478] RBP: ffffa60283e5be24 R08: 0000000000000006 R09: 0000000000000001^M

[ 108.505479] R10: 8080808080808080 R11: 0000000000000010 R12: 0000000000000080^M

[ 108.505479] R13: ffff8afedf86a000 R14: 0000000000000000 R15: ffff8b02de88a600^M

[ 108.505481] FS: 0000000000000000(0000) GS:ffff8b02efa40000(0000) knlGS:0000000000000000^M

[ 108.505482] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M

[ 108.505483] CR2: 0000000000000400 CR3: 0000000507c28003 CR4: 00000000007606e0^M

[ 108.505484] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M

[ 108.505484] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M

[ 108.505485] PKRU: 55555554^M

[ 108.505486] Call Trace:^M

[ 108.505496] ? __switch_to_asm+0x41/0x70^M

[ 108.526217] mlx5_core 0000:d8:00.1: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)^M

[ 108.542881] ? _hv_pcifront_read_config+0x140/0x140^M

[ 108.542941] mlx5_hv_config_common+0x53/0xf0 [mlx5_core]^M

[ 108.542980] mlx5_hv_vhca_control_agent_invalidate+0x44/0x130 [mlx5_core]^M

[ 108.563567] mlx5_core 0000:d8:00.1: Port module event: module 1, Cable unplugged^M

[ 108.576370] mlx5_hv_vhca_invalidate_work+0x53/0x80 [mlx5_core]^M

[ 108.595398] mlx5_core 0000:d8:00.1: mlx5_pcie_event:296:(pid 158): PCIe slot advertised sufficient power (27W).^M

[ 108.608795] process_one_work+0x1a7/0x360^M

[ 108.608797] worker_thread+0x30/0x390^M

[ 108.608799] ? create_worker+0x1a0/0x1a0^M

[ 108.608803] kthread+0x112/0x130^M

[ 108.608806] ? kthread_flush_work_fn+0x10/0x10^M

[ 108.845357] ret_from_fork+0x35/0x40^M

[ 108.854147] Modules linked in: mlx5_core(+) enic^M

[ 108.865233] Features: xt_u32 act_ct act_mpls^M

[ 108.875529] CR2: 0000000000000400^M

[ 108.883707] —[ end trace 73ba7b4f8ad0c9f9 ]—^M

[ 108.900672] RIP: 0010:hv_read_config_block+0xc4/0x150^M

[ 108.912681] Code: 24 40 83 e2 1f c7 44 24 48 09 00 49 42 09 d0 ba 10 00 00 00 44 89 74 24 4c 89 44 24 50 48 8b 43 38 48 c7 44 24 38 c0 f3 ad b0 <48> 8b b8 00 04 00 00 44 89 64 24 54 e8 3b 6a cb 01 85 c0 74 1f 48^M

[ 108.955888] RSP: 0018:ffffa60283e5bd80 EFLAGS: 00010246^M

[ 108.968309] RAX: 0000000000000000 RBX: ffff8b02bffce038 RCX: ffffa60283e5bdb8^M

[ 108.984950] RDX: 0000000000000010 RSI: ffffa60283e5bdc8 RDI: ffffa60283e5bd88^M

[ 109.001566] RBP: ffffa60283e5be24 R08: 0000000000000006 R09: 0000000000000001^M

[ 109.018167] R10: 8080808080808080 R11: 00000000000000

[Apr 15 15:44:10.064] 10 R12: 0000000000000080^M

[ 109.034760] R13: ffff8afedf86a000 R14: 0000000000000000 R15: ffff8b02de88a600^M

[ 109.051335] FS: 0000000000000000(0000) GS:ffff8b02efa40000(0000) knlGS:0000000000000000^M

[ 109.070007] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M

[ 109.083492] CR2: 0000000000000400 CR3: 0000000507c28003 CR4: 00000000007606e0^M

[ 109.100050] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M

[ 109.116591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M

[ 109.133130] PKRU: 55555554^M

[ 109.139871] Kernel panic - not syncing: Fatal exception^M

[Apr 15 15:44:11.257] [ 110.288463] Shutting down cpus with NMI^M

[ 110.368474] Kernel Offset: 0x2f200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M

[ 110.398993] —[ end Kernel panic - not syncing: Fatal exception ]—^M

Hello Madhusudhanan,

Thank you for posting your inquiry on the NVIDIA Networking Community.

Based on the information provided, we would recommend to upgrade the f/w of the adapter to the latest, which is version 16.30.1004. As your are using CentOS 8.3 which contains upstream kernel, it can be you have a very recent kernel which contains a compatibility issue with the f/w of the adapter.

As you are running INBOX driver, support needs to be obtained through the OS vendor. The following link provides an explanation around this support model → Upstream Releases/Inbox Drivers

Thank you and regards,

~NVIDIA Networking Technical Support