Hello,
One of our ESXI servers has failed with these errors, in a VSAN RDMA enabled environment:
vmkernel: cpu103:2098374)<NMLX_ERR> nmlx5_core: 0000:a1:00.1: Health: Miss counters detected
vmkernel: cpu89:10107206)<NMLX_ERR> nmlx5_core: vmnic5: nmlx5_en_EcnProtocolQuery - (nmlx5_core_en_ecn.c:247) nmlx5_QueryPortCong
Status failed: 195887328 protocol R_ROCE_RP
vmkernel: cpu89:10107206)<NMLX_ERR> nmlx5_core: vmnic5: nmlx5_en_EcnProtocolQuery - (nmlx5_core_en_ecn.c:270) done, status: Failure
vmkernel: cpu89:10107206)<NMLX_ERR> nmlx5_core: core: nmlx5_GetECNCap - (nmlx5_core_main.c:281) Fail to query NMLX5_CONG_PROTOCOL
_R_ROCE_RP (Failure)
vmkernel: cpu86:10107235)<NMLX_ERR> nmlx5_core: vmnic5: nmlx5_en_EcnProtocolQuery - (nmlx5_core_en_ecn.c:247) nmlx5_QueryPortCong
Status failed: 195887328 protocol R_ROCE_RP
vmkernel: cpu86:10107235)<NMLX_ERR> nmlx5_core: vmnic5: nmlx5_en_EcnProtocolQuery - (nmlx5_core_en_ecn.c:270) done, status: Failure
vmkernel: cpu86:10107235)<NMLX_ERR> nmlx5_core: core: nmlx5_GetECNCap - (nmlx5_core_main.c:281) Fail to query NMLX5_CONG_PROTOCOL
Hypervisor: VMware ESXi, 8.0.2, 23305546
Adapter Mellanox Technologies MT2894 Family [ConnectX-6 Lx]
Name vmnic5
Location PCI 0000:a1:00.1
Driver nmlx5_core
esxcli software vib list | grep nmlx5
nmlx5-cc 4.23.0.66-2vmw.802.0.0.22380479 VMW VMwareCertified 2023-11-13 host
nmlx5-core 4.23.0.66-2vmw.802.0.0.22380479 VMW VMwareCertified 2023-11-13 host
nmlx5-rdma 4.23.0.66-2vmw.802.0.0.22380479 VMW VMwareCertified 2023-11-13 host
Any idea on how to prevent this issue in the future?
Thanks