Unstable work with 2 cards 2 x MT27700 or 2xMT28800 with 100G Uplinks with random Link up/down events

Hello,

We have random Link Up / Down events for 2 x MT27700 cards connected to 2x100G Uplinks, what are configured as 2 separate VLAN’s with own gateways, both are connected to Juniper switch. Here is sample config of our netplan at ubuntu 18.04 https://pastebin.com/2P2ELS0N

dmesg

[79546.319000] mlx5_core 0000:04:00.1 eth0: Link down

[79551.219139] mlx5_core 0000:04:00.1 eth0: Link up

[109933.693829] mlx5_core 0000:04:00.1 eth0: Link down

[109938.643748] mlx5_core 0000:04:00.1 eth0: Link up

Each card connected to own NUMA, and we are using iface binding to route outgoing traffic with eth0 or eth1, but this link / up down events happens even if server is IDLE.

Here is the list of fixes we tried and what DIDN’T HELP:

  1. We changed NIC cards to 2 xMT28800

  2. We tried Ubuntu 18.04, 19.04, 19.10, Centos 8 with latest MLNX_OFED_LINUX-4.7-1.0.0.1 x86_64

  3. We tried built in kernel drivers 4.15, 5.0.x, 5.1.x, 5.2.x, 5.3.x

  4. We tried to replace all optics

  5. We tried to change ports on switch

  6. We tried to replace ALL SERVER hardware (MB, CPU, RAM, Power Supply)

  7. We tried to disable gro, lro, tso.

Server specs

CPU: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

RAM: 256GB

Manufacturer: Supermicro

Product Name: X9DRi-LN4+/X9DR3-LN4+

Dual MT27700 2x100G (We also tried MT28800 2x100G, but it didn’t fix it)

Could you say please, what else we may try to fix it?

P.S.: we also created Mellanox Case # 00703035 , but we are not subscribed for now, for your premium support service.

Hello Pavel,

Many thanks for posting your inquiry on the Mellanox Community.

Based on the information provided, it looks like this issue will need additional debugging and troubleshooting which we normally do through our official support process. We see that an official support case is opened so we want to continue to provide support through that ticket.

In the meantime, please make sure the following is in place:

  • Latest f/w installed on the adapters
  • Cables used based on the latest RN which lists the validated and tested cables:
  • Latest BIOS installed on the server
  • Watch for temperature messages before the link goes down
  • Simplify your configuration → instead of using netplan (which still contains a lot of bugs and issues), use /etc/network/interfaces (legacy ifupdown)
  • Clean install of the OS and watch for link flapping in idle state with no network traffic running
  • Through the support case, please provide network topology and switch vendor (Juniper), model, s/w running on the switch
  • Check switch logs and configuration on any abnormalities

Make sure this information is uploaded to the official support case so we can assist you further through that case.

Many thanks,

~Mellanox Technical Support

Hello,

Seems our case closed, as we don’t have live contract for support with you, as we are just rented server with your hardware and got into this issue, not re-sellers or buyers of your hardware.

So I will post all requested information here

  • Latest stable FW installed on both cards.
  • Fiber used: LC UPC to LC UPC Duplex OM3 Multimode LSZH 2.0mm Fiber Optic Patch Cable
  • Optics used:
    • Xcvr 48 REV 01 740-061405 F190903015 QSFP28-100G-AOC
    • Xcvr 50 REV 01 740-061405 F190903018 QSFP28-100G-AOC
  • Latest bios installed
  • No messages regarding temps in dmesg or anywhere
  • We tried ifupdown before with manual route adding for second NIC like
    • ip route add 1.1.1.0/29 dev eth0 src 1.1.1.2 table 102
    • ip route add default via 1.1.1.1 dev eth0 table 102
    • ip rule add from 1.1.1.2/29 table 102
    • ip rule add to 1.1.1.2/29 table 102
  • we tried to reinstall many times ubuntu 18.04, 19.04, 19.10, centos 8, we also tried upstream kernles 5.1.x, 5.2.x, 5.3.x, with your drivers or builtin.
  • we don’t have any anomalies in log, only these messages. datacenter also reported, that they don’t have any issues on their side.

Network topology

Our Juniper ex 4650 ---------------- Provider Cisco Router

|

|

Your Server

Switch

RE-EX4650-48Y-8C

Software Version 18.4R1-S3.1

We can provide access to our server if you wish or provide further info you’ll request.

Many Thanks