We currently have an issue in our VMware environment which is affecting multiple (10+) systems with “MCX4121A-ACAT” network adapters.
Our issue is a driver/firmware related issue with the Mellanox adapters that results in one of the physical nics disappearing randomly at boot time.
The hosts are running the following driver and firmware combinations which are certified according to VMwares HCL:
driver: 4.19.71.101 (some hosts are still running 4.19.71.1 but we see the issue regardless of driver version)
firmware: 14.29.1016
When we reboot a ESXi host with this configuration, the host often comes up with only 3 vmnics. The other vmnic is completely missing. We often need to reboot the host a few times for the missing vmnic to reappear.
esxcli network nic list (vmnic5 missing):
vmnic2 0000:3b:00.0 nmlx5_core Up Up 10000 Full 0c:42:a1:4a:2a:04 9000 Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA)
vmnic3 0000:3b:00.1 nmlx5_core Up Up 10000 Full 0c:42:a1:4a:2a:05 1600 Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA)
vmnic4 0000:af:00.0 nmlx5_core Up Up 10000 Full 0c:42:a1:4a:29:dc 9000 Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA)
lspci (vmnic5 displayed when running lspci):
0000:3b:00.0 Network controller Ethernet controller: Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA) [vmnic2]
0000:3b:00.1 Network controller Ethernet controller: Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA) [vmnic3]
0000:af:00.0 Network controller Ethernet controller: Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA) [vmnic4]
0000:af:00.1 Network controller Ethernet controller: Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA) [vmnic5]
mellanox tool (unable to open device for missing vmnic):
mlxfwmanager -d af:00.1 --query:
Status: Failed to open device
/opt/mellanox/bin/mst status -vv
PCI devices:
DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX4LX(rev:0) mt4117_pciconf2 3b:00.0
ConnectX4LX(rev:0) mt4117_pciconf1.1 3b:00.1
ConnectX4LX(rev:0) mt4117_pciconf3 af:00.0
We’ve contacted VMware support and they’ve said we need to contact the vendor as this is a firmware/driver issue (VMware Knowledge Base)
Troubleshooting already been carried out by us:
-
updated drivers and firmwares using different combinations → did not help
-
replaced mellanox nics with intel nics → fixed issue
-
booted a different OS distribution (ubuntu) → fixed issue
Does anyone here have any ideas how we can troubleshoot this issue? I feel we have exhausted all avenues available to us and now we need some assistance to identify the root cause of our issue.