ConnectX-5 Ex no longer negotiates 100GbE after CentOS upgrade

I updated several Dell C6420 clients with a Mellanox Technologies MT28800 Family [ConnectX-5 Ex] card from centos 7.4.1708 to centos 7.9.2009 and a lot of them no longer negotiate 100GbE connections and the link doesn’t come up.

some do … with the same cable, connected to the same switch … more confusingly, booting back into the older kernel, or even booting the older CentOS 7 installer image also does not bring up the link…

lspci -v | grep Connect

5e:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
5e:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

on a happy system, with centos 7.8 / 3.10.0-1127.el7.x86_64

mlxlink -d /dev/mst/mt4121_pciconf0

Operational Info

State : Active
Physical state : LinkUp
Speed : 100GbE
Width : 4x
FEC : Standard RS-FEC - RS(528,514)
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info

Enabled Link Speed : 0xf8f1f0d3 (100G,50G,40G,25G,10G,1G)
Supported Cable Speed : 0x2024a101 (100G,56G,50G,40G,25G,10G,1G)

Troubleshooting Info

Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed

Tool Information

Firmware Version : 16.32.2004
MFT Version : mft 4.22.1-11


Querying Cables …

Cable #1:

Cable name : mt4121_pciconf0_cable_0

No FW data to show
-------- Cable EEPROM --------
Identifier : QSFP28 (11h)
Technology : 850 nm VCSEL (00h)
Compliance : 100GBASE-SR4 or 25GBASE-SR
Wavelength : 850 nm
OUI : 0xac4afe
Vendor : DELL EMC
Serial number : CN04HG0017E4063
Part number : 14NV5
Revision : A1
Temperature [c] : 46 [-10…80]
Digital Diagnostic Monitoring : YES
Length [m] : 50 m

on another, identical system that was upgraded to centos 7.8 …

Supported Info

Enabled Link Speed : 0x0801f0d3 (40G,25G,10G,1G)
Supported Cable Speed : 0x2024a101 (100G,56G,50G,40G,25G,10G,1G)

State: Polling
Troubleshooting info:
Status Opcode: 2
Group Opcode: PHY FW
Recommendation: Negotiation failure …

same mst cable info,
both connected to a Z9264F-ON OS Version:

mlxconfig reset did not resolve the issue …

so far 7 systems have failed after the upgrade and I have many more left to upgrade so any tips would be very much appreciated!

Have you tried toggling the link on the failed servers?

Ive toggled the link, moved cables, swapped phy’s, mlxconfig reset, mlxfwreset, powercycled everything … booted back into older OS versions …

I went through the full list of packages that would be included in the CentOS 7.8 update and applied the most relevant ones individually. I found that the NetworkManager-1.18.4-3.el7.x86_64 broke 100GbE.
simply removing the package restores functionality but I still have no idea why it would change the negotiated / available speeds on these cards …

Great, thank you for the update and for sharing the solution for the issue.
Best regards,
Nvidia support