MCX4121A-ACAT PNIC disappearing - Affecting multiple systems running ESXi 7.0U1

We currently have an issue in our VMware environment which is affecting multiple (10+) systems with “MCX4121A-ACAT” network adapters.

Our issue is a driver/firmware related issue with the Mellanox adapters that results in one of the physical nics disappearing randomly at boot time.

The hosts are running the following driver and firmware combinations which are certified according to VMwares HCL:

driver: 4.19.71.101 (some hosts are still running 4.19.71.1 but we see the issue regardless of driver version)

firmware: 14.29.1016

When we reboot a ESXi host with this configuration, the host often comes up with only 3 vmnics. The other vmnic is completely missing. We often need to reboot the host a few times for the missing vmnic to reappear.

esxcli network nic list (vmnic5 missing):

vmnic2 0000:3b:00.0 nmlx5_core Up Up 10000 Full 0c:42:a1:4a:2a:04 9000 Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA)

vmnic3 0000:3b:00.1 nmlx5_core Up Up 10000 Full 0c:42:a1:4a:2a:05 1600 Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA)

vmnic4 0000:af:00.0 nmlx5_core Up Up 10000 Full 0c:42:a1:4a:29:dc 9000 Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA)

lspci (vmnic5 displayed when running lspci):

0000:3b:00.0 Network controller Ethernet controller: Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA) [vmnic2]

0000:3b:00.1 Network controller Ethernet controller: Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA) [vmnic3]

0000:af:00.0 Network controller Ethernet controller: Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA) [vmnic4]

0000:af:00.1 Network controller Ethernet controller: Mellanox Technologies ConnectX-4 Lx EN NIC; 25GbE; dual-port SFP28; (MCX4121A-ACA) [vmnic5]

mellanox tool (unable to open device for missing vmnic):

mlxfwmanager -d af:00.1 --query:

Status: Failed to open device

/opt/mellanox/bin/mst status -vv

PCI devices:


DEVICE_TYPE MST PCI RDMA NET NUMA

ConnectX4LX(rev:0) mt4117_pciconf2 3b:00.0

ConnectX4LX(rev:0) mt4117_pciconf1.1 3b:00.1

ConnectX4LX(rev:0) mt4117_pciconf3 af:00.0

We’ve contacted VMware support and they’ve said we need to contact the vendor as this is a firmware/driver issue (https://kb.vmware.com/s/article/2150890)

Troubleshooting already been carried out by us:

  • updated drivers and firmwares using different combinations → did not help

  • replaced mellanox nics with intel nics → fixed issue

  • booted a different OS distribution (ubuntu) → fixed issue

Does anyone here have any ideas how we can troubleshoot this issue? I feel we have exhausted all avenues available to us and now we need some assistance to identify the root cause of our issue.

Hello James,

Thank you for posting your inquiry on the NVIDIA Networking Community.

Based on the information provided, we recommend to open a NVIDIA Networking Support ticket (Valid support contract required) so our engineers can assist you with this issue, as it can require some extensive debugging. In most of the cases, it is a combination of which platform is being used, f/w and ESXi build.

You can open a support ticket by sending an email to the following email address → networking-support@nvidia.com

Thank you and regards,

~NVIDIA Networking Technical Support

Hi,

I’ve tried opening a support case (00946733) via https://support.mellanox.com/, but I received an email saying additional information is required, and nobody will be assigned to the ticket until this info is provided.

The problem is, authorization is required to access the link in the email, and there is no way of logging in. So it is impossible for me to provide this information and my case is stuck in limbo.

I’ve tried sending an email to support@mellanox.com a couple of weeks ago to no avail.

I even tried chatting with support and got sent a password reset link, but this doesn’t exactly help if there is no way of entering that password in order to login.

I will gladly work with you through the official support channel, but I need someone from NVIDIA Support to take ownership of my case (at the least help me to provide the additional info that is required).

Getting back to the issue at hand, we’ve observed that if we disable the nmlx5_rdma driver, all our vmnics remain present. If we enable the nmlx5_rdma driver (default), then we see the random disappearance of vmnics at boot time.

Regards,

James.