We have HPE DL380Gen10+ servers used as VMware ESXi hosts. This week I started update of one cluster. First 5 hosts without firmware updates, all went well. Then I included the latest HPE firmware packages which includes a firmware 26.34.1002 update for the MCX adapter in PCI slot, the firmware for OCP was already on latest HPE version 26.34.1002. Driver is latest from HPE web page, I also tried the latest from VMware.
Firmware component Host Version Image Version
Nvidia Network Adapter 26.33.1048 26.34.1002
Two Dual Port adapters are used, only one port of each connected to a Cisco ACI switch.
Mellanox MCX631102AS-ADAT Ethernet 10/25Gb 2-port SFP28 (PCI)
Mellanox MCX631432AS-ADAI Ethernet 10/25Gb 2-port SFP28 OCP3
OCP 3.0 Slot 10 Mellanox ConnectX-6 LX OCP3.0 A5 26.34.1002 Enabled
PCI-E Slot 1 Mellanox Network Adapter - B8:3F:D2:2D:A7:0A 26.34.1002 Enabled
After the update link flapping started on both hosts. I contacted VMware and HPE, as well as our network team. There is no clear response, all point to firmware update and/or the adapters. Problem is that HPE as vendor is not very helpful when it comes to issues other than completely failing hardware.
What I tried:
- downgrade of both adapters to old fw 26.33.1048, no change
- installed fw with ony one adapter active, other disabled in BIOS ( Document - Advisory: HPE Network Adapters - Firmware Flashing For Certain HPE NVIDIA (Mellanox) Network Adapters May Fail When Platforms Are Configured With More Than One Network Adapter | HPE Support)
- cold rebooting of servers, no change
- reset of adapter config with mlxconfig reset
- toggle link state with mlxlink, no change
I’m out of ideas here.
Sometimes I see an Status Opcode 14 in mlxlink, but not always. I first thought it would only be the port of the PCI adapter that has the issue, but after hours the flapping suddenly changes to the port of the OCP adapter. There were never both ports affected at the same time!
One thing that is still a mystery to me is FEC mode. There is no way to configure it directly in ESXi. Only with mlxlink tool. I see in output that it is set to Firecode FEC, network team told me that its set on switch side to “inherit” witch is kind of auto mode. I read a while ago that it should be RS-FEC depending on SFP. But whatever I try to set with mlxlink, I don’t see an difference in mlxlink output. This can be totally unrelated but FEC mode is something that I feel nobody in normal operations really takes care of (and there is no obvious way to do in ESXi).
# /opt/mellanox/bin/mlxlink -d mt4127_pciconf0 --show_fec
Operational Info
----------------
State : Active
Physical state : ETH_AN_FSM_ENABLE
Speed : 25G
Width : 1x
FEC : Firecode FEC
Loopback Mode : No Loopback
Auto Negotiation : ON
Supported Info
--------------
Enabled Link Speed (Ext.) : 0x00000052 (25G,10G,1G)
Supported Cable Speed (Ext.) : 0x00000052 (25G,10G,1G)
Troubleshooting Info
--------------------
Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed
Tool Information
----------------
Firmware Version : 26.34.1002
amBER Version : 2.08
MFT Version : mft 4.22.1.11
FEC Capability Info
-------------------
FEC Capability 25G : 0x7 (No-FEC, Firecode_FEC, RS-FEC (528,514))
FEC Capability 10G : 0x1 (No-FEC)
# /opt/mellanox/bin/mlxlink -d mt4127_pciconf1 --show_fec
Operational Info
----------------
State : Physical LinkUp
Physical state : ETH_AN_FSM_ENABLE
Speed : N/A
Width : N/A
FEC : N/A
Loopback Mode : No Loopback
Auto Negotiation : ON
Supported Info
--------------
Enabled Link Speed (Ext.) : 0x00000052 (25G,10G,1G)
Supported Cable Speed (Ext.) : 0x00000052 (25G,10G,1G)
Troubleshooting Info
--------------------
Status Opcode : 14
Group Opcode : PHY FW
Recommendation : Remote faults detected
Tool Information
----------------
Firmware Version : 26.34.1002
amBER Version : 2.08
MFT Version : mft 4.22.1.11
FEC Capability Info
-------------------
FEC Capability 25G : 0x7 (No-FEC, Firecode_FEC, RS-FEC (528,514))
FEC Capability 10G : 0x1 (No-FEC)
# esxcli network nic stats get -n vmnic0
NIC statistics for vmnic0
Packets received: 74462
Packets sent: 82059
Bytes received: 6566741
Bytes sent: 48312792
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 16572
Broadcast packets received: 48724
Multicast packets sent: 239
Broadcast packets sent: 2235
Total receive errors: 0
Receive length errors: 0
Receive over errors: 0
Receive CRC errors: 0
Receive frame errors: 0
Receive FIFO errors: 0
Receive missed errors: 0
Total transmit errors: 0
Transmit aborted errors: 0
Transmit carrier errors: 0
Transmit FIFO errors: 0
Transmit heartbeat errors: 0
Transmit window errors: 0
# esxcli network nic stats get -n vmnic2
NIC statistics for vmnic2
Packets received: 0
Packets sent: 1335
Bytes received: 0
Bytes sent: 160167
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 0
Broadcast packets received: 0
Multicast packets sent: 63
Broadcast packets sent: 655
Total receive errors: 0
Receive length errors: 0
Receive over errors: 0
Receive CRC errors: 0
Receive frame errors: 0
Receive FIFO errors: 0
Receive missed errors: 0
Total transmit errors: 35
Transmit aborted errors: 35
Transmit carrier errors: 0
Transmit FIFO errors: 0
Transmit heartbeat errors: 0
Transmit window errors: 0
...
... Before vmnic0
...
2023-02-10T18:57:19.715Z: [netCorrelator] 83502331us: [vob.net.vmnic.linkstate.down] vmnic vmnic0 linkstate down
...
... Then vmnic2
...
2023-02-10T19:44:04.385Z: [netCorrelator] 35172694us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-10T19:45:02.264Z: [netCorrelator] 91778300us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-10T19:45:58.970Z: [netCorrelator] 148483739us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-10T19:46:14.772Z: [netCorrelator] 164285571us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-10T19:46:46.376Z: [netCorrelator] 195888427us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-10T19:47:02.178Z: [netCorrelator] 211690401us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-10T19:47:33.831Z: [netCorrelator] 243343442us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
...
...
2023-02-11T09:19:49.774Z: [netCorrelator] 2082595168us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-11T09:20:05.626Z: [netCorrelator] 2098446905us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-11T09:20:37.279Z: [netCorrelator] 2130099896us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-11T09:20:53.082Z: [netCorrelator] 2145901802us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-11T09:21:24.685Z: [netCorrelator] 2177504665us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-11T09:21:40.487Z: [netCorrelator] 2193306622us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2023-02-11T09:22:37.143Z: [netCorrelator] 2249961967us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down