We have random Link Up / Down events for 2 x MT27700 cards connected to 2x100G Uplinks, what are configured as 2 separate VLAN’s with own gateways, both are connected to Juniper switch. Here is sample config of our netplan at ubuntu 18.04 https://pastebin.com/2P2ELS0N
[79546.319000] mlx5_core 0000:04:00.1 eth0: Link down
[79551.219139] mlx5_core 0000:04:00.1 eth0: Link up
[109933.693829] mlx5_core 0000:04:00.1 eth0: Link down
[109938.643748] mlx5_core 0000:04:00.1 eth0: Link up
Each card connected to own NUMA, and we are using iface binding to route outgoing traffic with eth0 or eth1, but this link / up down events happens even if server is IDLE.
Here is the list of fixes we tried and what DIDN’T HELP:
We changed NIC cards to 2 xMT28800
We tried Ubuntu 18.04, 19.04, 19.10, Centos 8 with latest MLNX_OFED_LINUX-4.7-220.127.116.11 x86_64
We tried built in kernel drivers 4.15, 5.0.x, 5.1.x, 5.2.x, 5.3.x
We tried to replace all optics
We tried to change ports on switch
We tried to replace ALL SERVER hardware (MB, CPU, RAM, Power Supply)
We tried to disable gro, lro, tso.
CPU: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Product Name: X9DRi-LN4+/X9DR3-LN4+
Dual MT27700 2x100G (We also tried MT28800 2x100G, but it didn’t fix it)
Could you say please, what else we may try to fix it?
P.S.: we also created Mellanox Case # 00703035 , but we are not subscribed for now, for your premium support service.
Many thanks for posting your inquiry on the Mellanox Community.
Based on the information provided, it looks like this issue will need additional debugging and troubleshooting which we normally do through our official support process. We see that an official support case is opened so we want to continue to provide support through that ticket.
In the meantime, please make sure the following is in place:
- Latest f/w installed on the adapters
- Cables used based on the latest RN which lists the validated and tested cables:
- Latest BIOS installed on the server
- Watch for temperature messages before the link goes down
- Simplify your configuration → instead of using netplan (which still contains a lot of bugs and issues), use /etc/network/interfaces (legacy ifupdown)
- Clean install of the OS and watch for link flapping in idle state with no network traffic running
- Through the support case, please provide network topology and switch vendor (Juniper), model, s/w running on the switch
- Check switch logs and configuration on any abnormalities
Make sure this information is uploaded to the official support case so we can assist you further through that case.
~Mellanox Technical Support
Seems our case closed, as we don’t have live contract for support with you, as we are just rented server with your hardware and got into this issue, not re-sellers or buyers of your hardware.
So I will post all requested information here
- Latest stable FW installed on both cards.
- Fiber used: LC UPC to LC UPC Duplex OM3 Multimode LSZH 2.0mm Fiber Optic Patch Cable
- Optics used:
- Xcvr 48 REV 01 740-061405 F190903015 QSFP28-100G-AOC
- Xcvr 50 REV 01 740-061405 F190903018 QSFP28-100G-AOC
- Latest bios installed
- No messages regarding temps in dmesg or anywhere
- We tried ifupdown before with manual route adding for second NIC like
- ip route add 18.104.22.168/29 dev eth0 src 22.214.171.124 table 102
- ip route add default via 126.96.36.199 dev eth0 table 102
- ip rule add from 188.8.131.52/29 table 102
- ip rule add to 184.108.40.206/29 table 102
- we tried to reinstall many times ubuntu 18.04, 19.04, 19.10, centos 8, we also tried upstream kernles 5.1.x, 5.2.x, 5.3.x, with your drivers or builtin.
- we don’t have any anomalies in log, only these messages. datacenter also reported, that they don’t have any issues on their side.
Our Juniper ex 4650 ---------------- Provider Cisco Router
Software Version 18.4R1-S3.1
We can provide access to our server if you wish or provide further info you’ll request.