PSOD VMware ESXi 7.0.3 build-19482537 | Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 25GB

Hi Team,

I have an issue where the VMware ESXi server encountered a PSOD, when I checked the backtrace it is pointing me towards that vmnic5 , vmnic6 , and vmnic7 experienced resets. These network interface cards (NICs) are Mellanox ConnectX-4 devices.

VMware ESXi version : VMware ESXi 7.0.3 build-19482537 | VMware ESXi 7.0 Update 3d
Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 25GB
Firmware Version: 14.32.2004
Driver Version: 4.21.71.101

**** vmkernel snippet ***

2024-03-18T09:23:52.722Z cpu60:2098313)0x453920a1bd40:[0x420017104eaa]MCSLockWait@vmkernel#nover+0x153 stack: 0x0

2024-03-18T09:23:52.722Z cpu66:2098309)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0xecd0b(0x420017000000):0x4000:0xf48] (Src 0x1, CPU66)

2024-03-18T09:23:52.722Z cpu46:2098312)0x45392099bd60:[0x42001710538a]MCSLockIRQWork@vmkernel#nover+0x4f stack: 0xedff26700000000

2024-03-18T09:23:52.722Z cpu37:2098310)0x45392089bd60:[0x42001710538a]MCSLockIRQWork@vmkernel#nover+0x4f stack: 0xedff26700000000

2024-03-18T09:23:52.722Z cpu60:2098313)0x453920a1bd60:[0x42001710538a]MCSLockIRQWork@vmkernel#nover+0x4f stack: 0xedff26700000000

2024-03-18T09:23:52.722Z cpu60:2098313)0x453920a1bee0:[0x4200171ef580]Pkt_AllocWithFlags@vmkernel#nover+0xd stack: 0x19f

2024-03-18T09:23:52.722Z cpu37:2098310)0x45392089bf00:[0x4200172a3ada]vmk_PktAlloc@vmkernel#nover+0x1f stack: 0x17e00

2024-03-18T09:23:52.722Z cpu45:2097239)WARNING: Uplink: 21014: Queue 0 of device vmnic6 stuck, resetting the device

2024-03-18T09:23:57.722Z cpu69:2097239)WARNING: Uplink: 21014: Queue 0 of device vmnic6 stuck, resetting the device

2024-03-18T09:23:57.722Z cpu69:2097239)WARNING: Uplink: 21014: Queue 0 of device vmnic7 stuck, resetting the device

2024-03-18T09:23:57.722Z cpu40:2097624)StorageDevice: 7059: End path evaluation for device eui.726471b0213175dc9747d2b600000023

2024-03-18T09:23:57.985Z cpu68:2097412)<NMLX_INF> nmlx5_core: vmnic5: nmlx5_en_UplinkLinkStateSetOS - (nmlx5_core_en_uplink.c:4764) Changing link status from DOWN Half Duplex 0 to UP Full Duplex 25000

2024-03-18T09:23:57.985Z cpu68:2097412)<NMLX_INF> nmlx5_core: vmnic5: nmlx5_en_UpdatePhyData - (nmlx5_core_en_main.c:995) called

2024-03-18T09:23:58.001Z cpu68:2097412)<NMLX_INF> nmlx5_core: vmnic6: nmlx5_en_UplinkReset - (nmlx5_core_en_uplink.c:4053) Watchdog process reset

2024-03-18T09:23:58.001Z cpu68:2097412)<NMLX_INF> nmlx5_core: vmnic6: nmlx5_en_UplinkQuiesceIOLocked - (nmlx5_core_en_main.c:2061) called

2024-03-18T09:23:58.018Z cpu71:2097412)<NMLX_INF> nmlx5_core: vmnic6: nmlx5_en_UplinkLinkStateSetOS - (nmlx5_core_en_uplink.c:4764) Changing link status from UP Full Duplex 25000 to DOWN Half Duplex 0

2024-03-18T09:24:07.723Z cpu62:2097239)WARNING: Uplink: 21014: Queue 0 of device vmnic7 stuck, resetting the device

2024-03-18T09:24:33.722Z cpu57:6632012)WARNING: Heartbeat: 827: PCPU 60 didn’t have a heartbeat for 49 seconds, timeout is 14, 3 IPIs sent; may be locked up.

2024-03-18T09:24:33.722Z cpu62:2814494)WARNING: Heartbeat: 827: PCPU 66 didn’t have a heartbeat for 49 seconds, timeout is 14, 3 IPIs sent; may be locked up.

2024-03-18T09:24:33.722Z cpu34:8100378)WARNING: Heartbeat: 827: PCPU 37 didn’t have a heartbeat for 49 seconds, timeout is 14, 3 IPIs sent; may be locked up.

2024-03-18T09:24:33.722Z cpu62:2814494)WARNING: Heartbeat: 849: PCPU 66 saved backtrace. Possible software error

2024-03-18T09:24:33.722Z cpu57:6632012)WARNING: Heartbeat: 849: PCPU 60 saved backtrace. Possible software error

*** Backtrace ***

2024-03-18T09:24:33.740Z cpu66:2098309)Panic: 589: Panic from another CPU (cpu 66, world 2098309): ip=0x4200170fbb91 randomOff=0x17000000:

NMI IPI: Panic requested by another PCPU. RIPOFF(base):RBP:CS [0x10484d(0x420017000000):0x420050800680:0xf48] (Src 0x1, CPU66)

2024-03-18T09:24:33.740Z cpu66:2098309)Panic: 767: Saved backtrace: pcpu 66 Heartbeat NMI

2024-03-18T09:24:33.740Z cpu66:2098309)pcpu 66 Heartbeat NMI: 0x45392081bd10:[0x42001710484c]MCSLockSpin@vmkernel#nover+0x41 stack: 0x4301b8303660, 0x420050800680, 0x43, 0x420017104eab, 0x0

2024-03-18T09:24:33.740Z cpu66:2098309)pcpu 66 Heartbeat NMI: 0x45392081bd40:[0x420017104eaa]MCSLockWait@vmkernel#nover+0x153 stack: 0x0, 0x42001710538b, 0x4301b8303640, 0x4301b8303660, 0xedff26b00000000

2024-03-18T09:24:33.740Z cpu66:2098309)pcpu 66 Heartbeat NMI: 0x45392081bee0:[0x4200171ef580]Pkt_AllocWithFlags@vmkernel#nover+0xd stack: 0x2b4, 0x4200172a3adb, 0x4311000a9740, 0x42001810b056, 0x17e00

2024-03-18T09:24:33.740Z cpu66:2098309)pcpu 66 Heartbeat NMI: 0x45392081bf00:[0x4200172a3ada]vmk_PktAlloc@vmkernel#nover+0x1f stack: 0x17e00, 0x0, 0x4311000a9740, 0x17e00, 0x4311000a9cd8

2024-03-18T09:24:33.740Z cpu66:2098309)pcpu 66 Heartbeat NMI: 0x45392081bf10:[0x42001810b055]nmlx5_en_PostRxWqes@(nmlx5_core)#+0xc2 stack: 0x4311000a9740, 0x17e00, 0x4311000a9cd8, 0x42001810ad4e, 0x1

2024-03-18T09:24:33.740Z cpu66:2098309)pcpu 66 Heartbeat NMI: 0x45392081bf40:[0x42001810ad4d]nmlx5_en_NetPollCB@(nmlx5_core)#+0x5a stack: 0x1, 0x0, 0x0, 0x4200172a6230, 0x0

2024-03-18T09:24:33.740Z cpu66:2098309)pcpu 66 Heartbeat NMI: 0x45392081bf70:[0x4200172a622f]NetPollWorldCallback@vmkernel#nover+0x190 stack: 0x37, 0x42001810acf4, 0x4200172a6222, 0x0, 0x0

2024-03-18T09:24:33.740Z cpu66:2098309)pcpu 66 Heartbeat NMI: 0x45392081bfe0:[0x4200173b290d]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0, 0x4200170c4b90, 0x0, 0x0, 0x0

2024-03-18T09:24:33.740Z cpu66:2098309)pcpu 66 Heartbeat NMI: 0x45392081c000:[0x4200170c4b8f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0

2024-03-18T09:24:33.758Z cpu66:2098309)Panic: 727: Halting PCPU 66.

I tried to search for the known issues / bug fixes in the latest driver version i.e. 4.22.73.1004 but couldn’t find anything that matches my symptoms, I was wondering if anyone has seen this before and what shall be the solution.

Any ticks?

Thanks,
Hamza

Dear Hamza,

Thank you for writing us!
This issue is a bit to complex to be debugged in a community thread.
I would advice you open a case with the Enterprise Support team to further debug and to figure out why you are experiencing this issue.

Thanks,
Ilan.