Unstable work with 2 cards 2 x MT27700 or 2xMT28800 with 100G Uplinks with random Link up/down events

hostboss · November 17, 2019, 7:44pm

Hello,

We have random Link Up / Down events for 2 x MT27700 cards connected to 2x100G Uplinks, what are configured as 2 separate VLAN’s with own gateways, both are connected to Juniper switch. Here is sample config of our netplan at ubuntu 18.04 https://pastebin.com/2P2ELS0N

dmesg

[79546.319000] mlx5_core 0000:04:00.1 eth0: Link down

[79551.219139] mlx5_core 0000:04:00.1 eth0: Link up

[109933.693829] mlx5_core 0000:04:00.1 eth0: Link down

[109938.643748] mlx5_core 0000:04:00.1 eth0: Link up

Each card connected to own NUMA, and we are using iface binding to route outgoing traffic with eth0 or eth1, but this link / up down events happens even if server is IDLE.

Here is the list of fixes we tried and what DIDN’T HELP:

We changed NIC cards to 2 xMT28800
We tried Ubuntu 18.04, 19.04, 19.10, Centos 8 with latest MLNX_OFED_LINUX-4.7-1.0.0.1 x86_64
We tried built in kernel drivers 4.15, 5.0.x, 5.1.x, 5.2.x, 5.3.x
We tried to replace all optics
We tried to change ports on switch
We tried to replace ALL SERVER hardware (MB, CPU, RAM, Power Supply)
We tried to disable gro, lro, tso.

Server specs

CPU: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

RAM: 256GB

Manufacturer: Supermicro

Product Name: X9DRi-LN4+/X9DR3-LN4+

Dual MT27700 2x100G (We also tried MT28800 2x100G, but it didn’t fix it)

Could you say please, what else we may try to fix it?

P.S.: we also created Mellanox Case # 00703035 , but we are not subscribed for now, for your premium support service.

MvB · November 18, 2019, 8:43pm

Hello Pavel,

Many thanks for posting your inquiry on the Mellanox Community.

Based on the information provided, it looks like this issue will need additional debugging and troubleshooting which we normally do through our official support process. We see that an official support case is opened so we want to continue to provide support through that ticket.

In the meantime, please make sure the following is in place:

Latest f/w installed on the adapters
Cables used based on the latest RN which lists the validated and tested cables:
- ConnectX-5 → http://www.mellanox.com/pdf/firmware/ConnectX5-FW-16_26_1040-release_notes.pdf
- ConnectX-4 → http://www.mellanox.com/pdf/firmware/ConnectX4-FW-12_26_1040-release_notes.pdf
Latest BIOS installed on the server
Watch for temperature messages before the link goes down
Simplify your configuration → instead of using netplan (which still contains a lot of bugs and issues), use /etc/network/interfaces (legacy ifupdown)
Clean install of the OS and watch for link flapping in idle state with no network traffic running
Through the support case, please provide network topology and switch vendor (Juniper), model, s/w running on the switch
Check switch logs and configuration on any abnormalities

Make sure this information is uploaded to the official support case so we can assist you further through that case.

Many thanks,

~Mellanox Technical Support

hostboss · November 19, 2019, 2:19am

Hello,

Seems our case closed, as we don’t have live contract for support with you, as we are just rented server with your hardware and got into this issue, not re-sellers or buyers of your hardware.

So I will post all requested information here

Latest stable FW installed on both cards.
Fiber used: LC UPC to LC UPC Duplex OM3 Multimode LSZH 2.0mm Fiber Optic Patch Cable
Optics used:
- Xcvr 48 REV 01 740-061405 F190903015 QSFP28-100G-AOC
- Xcvr 50 REV 01 740-061405 F190903018 QSFP28-100G-AOC
Latest bios installed
No messages regarding temps in dmesg or anywhere
We tried ifupdown before with manual route adding for second NIC like
- ip route add 1.1.1.0/29 dev eth0 src 1.1.1.2 table 102
- ip route add default via 1.1.1.1 dev eth0 table 102
- ip rule add from 1.1.1.2/29 table 102
- ip rule add to 1.1.1.2/29 table 102
we tried to reinstall many times ubuntu 18.04, 19.04, 19.10, centos 8, we also tried upstream kernles 5.1.x, 5.2.x, 5.3.x, with your drivers or builtin.
we don’t have any anomalies in log, only these messages. datacenter also reported, that they don’t have any issues on their side.

Network topology

Our Juniper ex 4650 ---------------- Provider Cisco Router

|

Your Server

Switch

RE-EX4650-48Y-8C

Software Version 18.4R1-S3.1

We can provide access to our server if you wish or provide further info you’ll request.

Many Thanks

Topic		Replies	Views
Hello, Everyone. I have one Ethernet controller [0200]: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] [15b3:1007]. I am unable to bring the interface up. I have four other machines and all are working and connecting the the same switch. Ethernet Adapter Cards	1	649	March 5, 2021
Link flapping after firmware update Ethernet Adapter Cards	8	4824	March 29, 2023
Dual port Connectx5 both ports go down Ethernet Adapter Cards	6	1304	October 25, 2019
Interface Link State is DOWN Always Ethernet Adapter Cards mellanox-ofed	3	2066	November 13, 2023
ConnectX-6 NICs frequently experience link up and link down situations. Please help and advise Ethernet Adapter Cards	1	36	April 15, 2026
No link detected with ConnectX-4 Lx Ethernet Adapter Cards ethernet , ethtool , flint , mlxlink	1	1149	March 25, 2025
problem about [ConnectX-4 Lx] card after upgrading mst , mlxcables	4	821	October 3, 2017
My ConnectX does not send or receive ethernet packets anymore Adapters and Cables	7	1282	April 26, 2019
Trouble connecting two ConnectX-2 cards directly via Fibre Adapters and Cables	2	463	January 22, 2019
SUSE Linux (HANA Server) - Showing Mellanox Card "No Link Detected" Mellanox OFED flint	1	517	March 21, 2017

Unstable work with 2 cards 2 x MT27700 or 2xMT28800 with 100G Uplinks with random Link up/down events

Related topics