CentOS 7.8 2003 vs. mlx5_core vs. Port module event[error]: module 0, Cable error, Power budget exceeded and amber flashing led

We have two new Dell servers (R740 with ConnectX-5 MT28800 Dual port adapter) and (R640 with ConnectX-4 MT27700 Dual port adapter) both using 1 x Dell Q28-100G-LR4 optics pr. adapter.

No matter what I do, I am unable to get link using a Corning OS2 cable.

Both adapters are set to Ethernet and all Dell firmware has been updated on the servers.

We are not using the Mellanox nor the Dell drivers, but the inbox drivers in CentOS.

Everytime I plug-in the QSFP, this message is listed in dmesg:

[581258.513322] mlx5_core 0000:3b:00.0: Port module event[error]: module 0, Cable error, Power budget exceeded

On the ConnectX-5 card the following parameters are set pr. default using the inbox drivers:

mstconfig -d 3b:00.0 q|grep “POWER”

DISABLE_SLOT_POWER_LIMITER True(1)

ADVANCED_POWER_SETTINGS True(1)

On the ConnectX-4 cards the settings are not present, and therefore not set.

lspci shows there is plenty of power on the PCIe slot, 75W. The QSFP requires 3.5W max, hence it should have allot of power available.

We are running the following firmware on the adapters:

FW Version: 16.25.4062

FW Release Date: 5.6.2019

Part Number: 09FTMY_071C1T_Ax

Description: Mellanox ConnectX-5 Ex Dual Port 100 GbE QSFP Network Adapter

Product Version: 16.25.4062

Rom Info: type=UEFI version=14.18.19 cpu=AMD64

type=PXE version=3.5.701 cpu=AMD64

FW Version: 12.25.1020

FW Release Date: 30.4.2019

Part Number: 0068F2_0NNJ2M_Ax

Description: Mellanox ConnectX-4 Dual Port EDR PCIE Adapter LP

Product Version: 12.25.1020

Rom Info: type=PXE version=3.5.701 cpu=AMD64

The servers are currently connected back-2-back, and still no connection. What can be causing the connection issue?

Since the problem is seen as the QSFP modules are plugged into the NIC. it seems to be power issues with PCIe?

Can someone help with this?

As a followup, I have followed the Getting started guides for Linux, and everything seems to be correct, but still I see the issue.

If I replace the Dell optics with a Cisco / Finisar or SKYLANE optics, I am able to get link on the R740 servers (I have two of them) but the power budget error is still seen. The R640 using ConnectX-4 does not work no matter what I change.

Also I have tried using different OS2 cables from Tyco (CommScope) and Corning and both types works with all other 100G connections we have in our headend, but I simply can’t get them to working in these servers.

I expect this to be a firmware or drive issue. I did try installing CentOS and the supported drivers from both Dell and Mellanox, but the problem is the same, hence I expect the issue to be something else, but what?

I managed to get my ConnectX-5 working, but I am still struggling with the Connect-4 modules. The issue with the ConnectX-5 was related to the Dell QSFP. They were sending on a different wavelength than our Cisco switch. Replacing the module with at Cisco module solved this issue.

Dell:

Encoding : 0x03 (NRZ)

Laser wavelength : 1310.000nm

Laser wavelength tolerance : 2.245nm

Cisco:

Encoding : 0x05 (64B/66B)

Laser wavelength : 1302.350nm

Laser wavelength tolerance : 1.030nm

This was identified using:

ethtool -m p2p1

We also had issue finding the correct firmware for the NIC, as they are Dell branded, I would expect to find them on Dell’s site, but the driver package did not contain the model we had installed.

I found new firmware here that got us to the latest firmware for the cards:

Hello Kim,

Thank you for posting your inquiry to the Mellanox Community.

Unfortunately, we are unable to provide support for Dell OEM devices - They have their own firmware and board revisions that we do not have the ability to troubleshoot.

You will have to open a case via Dell support in order to obtain guidance on this matter.

Best regards,

Mellanox Technical Support