Dual port Connectx5 both ports go down

Have Debian Buster hosts (5 of them) with dual port connectx5 NICs running 100GE. Networking is to two separate switches with L3 redundancy (no LACP etc.). Each port/interface on the NIC has an ip and can send traffic to one of the two switches. Routing etc. is working as expected.

The issue I see is that when the connection to one switch goes down, both interfaces on the NIC go down. This can be initiated by pulling a cable/optic or doing a shutdown from the switch side. There is an oddity that this doesn’t happen 100% of the time, maybe 20% things work as expected and the other port does stay up. Bouncing the other port from the switch or host side gets it back.

This is a bit of a problem as the system is relying on the dual NIC ports for network redundancy.

Seeing this with OFED and inbox drivers. I have one version back on the driver.

Also have 25GE connectx4 dual port NICs in the hosts connected to the same switches and they are not showing this issue.

Any suggestions on how to resolve this?

Thanks!

Tom Rockwell

Michigan State University

More info. It seems to be more frequent that when the second port goes down, the first also goes down. Here I bounce the link to first port from switch side twice and then the second port from switch side. Note that when second port is bounced, both go down (event at 512 seconds).

dmesg output:

[ 300.187414] mlx5_core 0000:01:00.0 ens3f0: Link down

[ 364.342671] mlx5_core 0000:01:00.0 ens3f0: Link up

[ 383.325493] mlx5_core 0000:01:00.0 ens3f0: Link down

[ 398.979253] mlx5_core 0000:01:00.0 ens3f0: Link up

[ 512.779941] mlx5_core 0000:01:00.0 ens3f0: Link down

[ 512.780183] mlx5_core 0000:01:00.1 ens3f1: Link down

[ 558.648130] mlx5_core 0000:01:00.1 ens3f1: Link up

I do see that the link_down_reason is different for the two ports, this is port down’ed from switch side:

cat /sys/class/net/ens3f1/debug/link_down_reason

monitor_opcode: 0x2

status_message: Negotiation failure

This is the port that unexpectedly went down:

cat /sys/class/net/ens3f0/debug/link_down_reason

monitor_opcode: 0xe

status_message: Remote faults detected

Driver and fw versions:

ethtool -i ens3f0

driver: mlx5_core

version: 4.5-1.0.1

firmware-version: 16.24.1000 (MT_0000000008)

expansion-rom-version:

bus-info: 0000:01:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: no

supports-register-dump: no

supports-priv-flags: yes

ethtool -i ens3f0

driver: mlx5_core

version: 4.5-1.0.1

firmware-version: 16.24.1000 (MT_0000000008)

expansion-rom-version:

bus-info: 0000:01:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: no

supports-register-dump: no

supports-priv-flags: yes

more testing… the NIC is a VPI version. MCX556A-ECAT. Added an EN version MCX516A-CCAT to system and it behaves the same.

if I use one port from each NIC, then no problem.

Hi Tom,

Since you have a support contract with us, I will working via the support ticket with you and will send an email from the case. Working via the support ticket would be much efficient in assisting you in the best possible way.

Thanks,

Namrata.

Hi,

An update for people viewing this post. Mellanox support has reached out and has opened a case.

I have realized that the issue doesn’t happen with DAC cables. We have third party optics (not validated by Mellanox) and it may be that there is some interaction between the optics and the NIC.

Tom Rockwell

We obtained some Mellanox optics and the problem goes away when they are used.

I’ll update if the third party optic supplier is able to resolve the issue.

Tom Rockwell

I had similar problem both on w10 and Centos 7.6. The problem is the card gets overheated, monitor the temperature with mget_temp -d mt4119_pciconf0. In my system I got 93 degree and when hitting 103 card switched off. It is often with optics because that ads 4.5W to the total power; DAC cable are passive. I solved the issue by adding a PCI slot fan in front of the card and qsfp cages.

https://www.titan-cd.com/en/product/12V-DC-Adjustable-Dual-X-Houlder-with-Two-Fans-for-PCI-Slot-System-Cooler-DIY-Mounting-Ventilation-Cooling-Fan/TTC-SC07TZ_RB.html

Let me know if it solves the issue for you.

Best regards, Stelian