Hashing of egress RoCEv2 connections over link aggregates

vsimola · April 24, 2024, 7:57am

Hello,

Does someone know whether the egress port selection of RoCEv2 connections on a dual port CX-7 follow the Linux/Bonding ‘Transmit hash Policy’ knob, or is this configured via some other means?

Cards are running 28.36.1700 fw on Alma Linux 9.2 with 5.14.0-284.30.1.el9_2.x86_64 kernel mlx5_core/5.8-3.0.7

Bit background/context:

I have some dual port CX-7 cards configured in a bonding arrangement using LACP and they also correctly form mlx5_bond_0 ‘ibdevice’, I can seemingly successfully run RDMA over the said connections without any obvious issues.

However, I’ve noticed that the egress path selection is far from ideal in our bit peculiar scenario and I suspect this has to do with the elements used in the construction of the hash to determine which one of the two ports to use. At the moment this is set to L2+L3 which doesn’t make much sense to me. Before attempting to modify this I wanted to ask whether the RoCEv2 connections should follow what is defined at the OS ‘Transmit Hash Policy’ bonding option, or if this is controlled via some other means, mlxconfig etc.

Another question: Does someone know of a document that covers what ROCE_ADAPTIVE_ROUTING_EN knob in the mlxconfig does and whether it is related to the egress path selection?

Any insights on the matter are greatly appreciated,

–
Vesa

shim1 · April 26, 2024, 6:02am

hi vsimola

you can use queue affinity method if hash can not work well:
Change bond TX mode(should before bond create):
mlxconfig -d 0000:b5:00.0(replace with your card pcie BDF) -y s LAG_RESOURCE_ALLOCATION=0
mlxconfig -d 0000:b5:00.1 -y s LAG_RESOURCE_ALLOCATION=0
reboot the server
echo queue_affinity > /sys/class/net/enp1s0f0np0(replace with your interface)/compat/devlink/lag_port_select_mode
echo queue_affinity > /sys/class/net/enp1s0f1np1/compat/devlink/lag_port_select_mode

Thank you
Meng, Shi

vsimola · May 2, 2024, 9:55am

Hello,

Thank you for the suggestion. I tried the procedure you mentioned but didn’t notice any difference in the behavior. Can you maybe verify whether the operating system level choice of hash input data has any impact on the roce traffic? E.g. Transmit hash policy in /proc/net/bonding/bondX, or is the input data controlled somehow differently? I guess my question is, what data is considered for the hash? I’ve tried all of the three obvious choices (L2, L2+L3 and L3+L4) at the OS level but so far it seems they do not impact the behavior.

The traffic flow I am testing with is “one-to-two” where I have one instance of ib_send_bw listening on both receivers and sender is establishing two identical connections in the below fashion:

ib_send_bw -q $Q -s $MSIZE --report_gbits -D $DURATION -d $DEVICE -p $C_PORT $RECEIVER1 &
ib_send_bw -q $Q -s $MSIZE --report_gbits -D $DURATION -d $DEVICE -p $C_PORT $RECEIVER2

Maybe bit silly question related to the same matter: If i establish several queue pairs, would they all share the same udp source port? Surely, if i start several instances of ib_send_bw each with N queue pairs they’d use different source ports, hence providing entropy for the hash assuming that the source port counts towards the input?

Thanks for your help,

–
Vesa

vsimola · May 3, 2024, 8:52am

Hello,

Seems that the queue affinity thing worked after all.

The key here was to restart the bond interface after changing to queue_affinity:

echo queue_affinity > /sys/class/net/$INTERFACE(s)/compat/devlink/lag_port_select_mode
nmcli connection down "Bond $NAME"
nmcli connection up "Bond $NAME"

This appeared to solve the issue for now. Thanks alot for your help.

–
Vesa

system · May 17, 2024, 8:52am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Patch needed to activate ROCEV2 for Connect 3X 10G card Ethernet Adapter Cards	12	363	February 8, 2017
Bonding/teaming over multiple adapters. Adapters and Cables	3	425	October 15, 2021
How to configure host chaining for ConnectX-5 VPI InfiniBand/VPI Adapter Cards	16	2455	October 29, 2018
ConnectX-4 RoCE speed less than expected Ethernet Adapter Cards	7	1450	July 14, 2023
Where's the procedure of packing network protocol header in RoCE v2?	13	891	December 12, 2017
ConnectX5 ASAP2 OVS VXLAN offload + bond not working properly Ethernet Adapter Cards asap2 , problem , adapters-and-cables , port , echo	5	1352	June 30, 2021
mlx5_0 port is Down ! InfiniBand/VPI Adapter Cards	9	3415	April 12, 2024
Trouble on ConnectX5 & RoCE v2 on Linux programming InfiniBand/VPI Adapter Cards	8	1898	October 4, 2023
Low throughput on Connectx-3 on both Linux and Windows InfiniBand/VPI Adapter Cards	3	1796	September 5, 2019
Teaming (or bonding) ports on Connect-X 4 with MLAG. InfiniBand/VPI Switch Systems	10	1854	December 6, 2018

Hashing of egress RoCEv2 connections over link aggregates

Related topics