Teaming (or bonding) ports on Connect-X 4 with MLAG.

Hello, Mellanox Community.

We have two Mellanox switches SN2100s with Cumulus Linux. On that switches we configured Multi-Chassis Link Aggregation - MLAG.

The dual-connected devices (servers or switches) must use LACP (IEEE 802.3ad)

Every machine in our network has two Mallanox Connect-X 4 NICs and connected to both switches.

Network is redundant: it works fine even we turn off one switch.

For example, when we configured teaming (2*25Gbps) on Windows Server 2012 R2, then we have real (tested with iperf) up to 50 Gbps outgoing speed, but on CentOS 7.3 we haven’t DOUBLE outgoing speed…

On CentOS we have only double inbound speed, but outbound speed is limited by speed of one link.

We trying to use all of bond and team types, balancing algorithms, etc…

What are we doing wrong?

Hello, Eddie. We did it!

We set up our network card (on PCIe x8), which it gives up to 54.4 Gbps. So these calculations are good, but not true.

[SUM] 0.0-10.0 sec 63.3 GBytes 54.4 Gbits/sec

P.S.: We used teaming, not bonding.

In any case, thanks!

Thanks for helping!

Hi Nikita,

The calculation is:

PCI_LANES(8)*PCI_SPEED(8)*PCI_ENCODING(64/66)*PCI_HEADERS(128/152)*PCI_FLOW_CONT(0.95) = ~49.6 G

Hi,

Can you please run the below on the cent-os server:

  1. cat /proc/net/bonding/bond0

  2. lspci -d 15b3: -vvv

Hi Nikita,

I assume you are using multiple thread with iperf,right?

I think the issue is that the Linux bonding driver, default tx hash policy is L2 (Mac based) so iperf doesn’t spread since the L4 (different tcp ports) are not being considered in the hash.

how to verify the hash policy:

cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation

Transmit Hash Policy: layer2 (0) <-------------------------------------

MII Status: down

MII Polling Interval (ms): 100

Up Delay (ms): 0

Down Delay (ms): 0

802.3ad info

LACP rate: fast

Min links: 0

Aggregator selection policy (ad_select): stable

bond bond0 has no active aggregator

to change that :

in the bond ifcfg file add the xmit_hash_policy parameter

BONDING_OPTS="mode=802.3ad xmit_hash_policy=layer3+4 "

after changing:

[root@l-csi-demo-03 network-scripts]# cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation

Transmit Hash Policy: layer3+4 (1)

Hi Nikita,

Are you all set with the answer?

Hello, eddie.notz!

Yes, we used iperf with a lot of threading parameter (-P 20 - for example).

We used xmit_hash_policy parameters, but nothing changed. Linux machine has 2*50Gbps, but can to transfer only up to 50Gbps.

Here are our configuration files:

/etc/modprobe.d/bonding.conf

alias bond0 bonding

options bond0 miimon=80 mode=4 xmit_hash_policy=layer3+4 lacp_rate=1

/etc/sysconfig/network-scripts/ifcfg-bond0

DEVICE=bond0

IPADDR=**********

NETMASK=*************

GATEWAY=************

ONBOOT=yes

BOOTPROTO=none

USERCTL=no

MTU=9216

BONDING_OPTS=“mode=802.3ad miimon=80 lacp_rate=1 xmit_hash_policy=layer3+4”

/proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation

Transmit Hash Policy: layer3+4 (1)

MII Status: up

MII Polling Interval (ms): 80

Up Delay (ms): 0

Down Delay (ms): 0

802.3ad info

LACP rate: fast

Min links: 0

Aggregator selection policy (ad_select): stable

System priority: 65535

System MAC address: ***********

Active Aggregator Info:

Aggregator ID: 15

Number of ports: 2

Actor Key: 1

Partner Key: 21

Partner Mac Address: 44:38:39:ff:01:01

Slave Interface: ens4f0

MII Status: up

Speed: 50000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: ************

Slave queue ID: 0

Aggregator ID: 15

Actor Churn State: none

Partner Churn State: none

Actor Churned Count: 0

Partner Churned Count: 0

details actor lacp pdu:

system priority: 65535

system mac address: ***************

port key: 1

port priority: 255

port number: 1

port state: 63

details partner lacp pdu:

system priority: 65535

system mac address: 44:38:39:ff:01:01

oper key: 21

port priority: 255

port number: 1

port state: 63

Slave Interface: ens4f1

MII Status: up

Speed: 50000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: *************

Slave queue ID: 0

Aggregator ID: 15

Actor Churn State: none

Partner Churn State: none

Actor Churned Count: 0

Partner Churned Count: 0

details actor lacp pdu:

system priority: 65535

system mac address: *************

port key: 1

port priority: 255

port number: 2

port state: 63

details partner lacp pdu:

system priority: 65535

system mac address: 44:38:39:ff:01:01

oper key: 21

port priority: 255

port number: 1

port state: 63

Thank you for your response!

Hi Nikita,

This NIC is:

MCX414A-GCAT -

ConnectX-4 EN network interface card, 50GbE Dual-port QSFP28, PCIe3.0 x8, tall bracket, ROHS R6

Pcie3.0 X8 == aggragete bw of both ports will be around ~50G since the it’s limited by pci capabilites.

MCX416A-BCAT - will able to reach more then 50GbE since it supports Pcie Gen3 X16

ConnectX-4 EN network interface card, 40GbE dual-port QSFP, PCIe3.0 x16, tall bracket, ROHS R6

also make sure that the pci slot itself supports X16

Hi Eddie.

Yes, we have MCX414A-GCAT ConnectX-4 EN network interface card, 50GbE dual-port QSFP28, PCIe3.0 x8.

PCIe3.0 x8 is limited by 63.04 Gbits/sec. https://superuser.com/questions/586819/does-pcie-3-0-x8-provide-enough-bandwidth-for-a-dual-qsfp-40gbit-nic https://superuser.com/questions/586819/does-pcie-3-0-x8-provide-enough-bandwidth-for-a-dual-qsfp-40gbit-nic

If we had received a data transfer rate of at least slightly more than 50Gbps, I would have calmed down. However, we do not get either 51 or 52.

What other ideas are there?

And of course many thanks for your help!

Original message from @Nikita Nikora​