Hashing of egress RoCEv2 connections over link aggregates

Hello,

Does someone know whether the egress port selection of RoCEv2 connections on a dual port CX-7 follow the Linux/Bonding ‘Transmit hash Policy’ knob, or is this configured via some other means?

Cards are running 28.36.1700 fw on Alma Linux 9.2 with 5.14.0-284.30.1.el9_2.x86_64 kernel mlx5_core/5.8-3.0.7

Bit background/context:

I have some dual port CX-7 cards configured in a bonding arrangement using LACP and they also correctly form mlx5_bond_0 ‘ibdevice’, I can seemingly successfully run RDMA over the said connections without any obvious issues.

However, I’ve noticed that the egress path selection is far from ideal in our bit peculiar scenario and I suspect this has to do with the elements used in the construction of the hash to determine which one of the two ports to use. At the moment this is set to L2+L3 which doesn’t make much sense to me. Before attempting to modify this I wanted to ask whether the RoCEv2 connections should follow what is defined at the OS ‘Transmit Hash Policy’ bonding option, or if this is controlled via some other means, mlxconfig etc.

Another question: Does someone know of a document that covers what ROCE_ADAPTIVE_ROUTING_EN knob in the mlxconfig does and whether it is related to the egress path selection?

Any insights on the matter are greatly appreciated,


Vesa

hi vsimola

you can use queue affinity method if hash can not work well:
Change bond TX mode(should before bond create):
mlxconfig -d 0000:b5:00.0(replace with your card pcie BDF) -y s LAG_RESOURCE_ALLOCATION=0
mlxconfig -d 0000:b5:00.1 -y s LAG_RESOURCE_ALLOCATION=0
reboot the server
echo queue_affinity > /sys/class/net/enp1s0f0np0(replace with your interface)/compat/devlink/lag_port_select_mode
echo queue_affinity > /sys/class/net/enp1s0f1np1/compat/devlink/lag_port_select_mode

Thank you
Meng, Shi

Hello,

Thank you for the suggestion. I tried the procedure you mentioned but didn’t notice any difference in the behavior. Can you maybe verify whether the operating system level choice of hash input data has any impact on the roce traffic? E.g. Transmit hash policy in /proc/net/bonding/bondX, or is the input data controlled somehow differently? I guess my question is, what data is considered for the hash? I’ve tried all of the three obvious choices (L2, L2+L3 and L3+L4) at the OS level but so far it seems they do not impact the behavior.

The traffic flow I am testing with is “one-to-two” where I have one instance of ib_send_bw listening on both receivers and sender is establishing two identical connections in the below fashion:

ib_send_bw -q $Q -s $MSIZE --report_gbits -D $DURATION -d $DEVICE -p $C_PORT $RECEIVER1 &
ib_send_bw -q $Q -s $MSIZE --report_gbits -D $DURATION -d $DEVICE -p $C_PORT $RECEIVER2

Maybe bit silly question related to the same matter: If i establish several queue pairs, would they all share the same udp source port? Surely, if i start several instances of ib_send_bw each with N queue pairs they’d use different source ports, hence providing entropy for the hash assuming that the source port counts towards the input?

Thanks for your help,


Vesa

Hello,

Seems that the queue affinity thing worked after all.

The key here was to restart the bond interface after changing to queue_affinity:

echo queue_affinity > /sys/class/net/$INTERFACE(s)/compat/devlink/lag_port_select_mode
nmcli connection down "Bond $NAME"
nmcli connection up "Bond $NAME"

This appeared to solve the issue for now. Thanks alot for your help.


Vesa

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.