Unexpected source mac seen when sending out roce traffic when using ib_send_bw/-R/rdma_cm

Hi,

I have a bunch of hosts talking rocev2 across LACP link aggregates terminating to two different switches (EVPN/ESI) and I’ve noticed that the switches keep blocking ports due to duplicate mac addresses seen. E.g. mac address table fluctuation. I can reliably reproduce this when using ib_send_bw with the -R (rdma_cm) flag, behavior doesn’t seem to surface if I leave the -R out. Also, the problem doesn’t show up if I disable the ports connecting to one of the two switches. Furthermore, it does not show up if I heavily load the setup with iperf3.

What I’ve done so far as an attempt to understand what is going on:

On the hosts involved, i’ve run tcpdump using mlx5_bond_0 as the device and filtered out the macs located on the cards forming up the bond. I can see egress frames which I believe shouldn’t be visible on the hosts, as if it indeed was sending out traffic with wrong src mac. From the host NIC perspective this tcpdump output certainly could also happen if the switch started to behave like a hub. However, this doesn’t seem to be the case as when i run ib_send_bw without -R or use other tools such as iperf3 to load the links. Hub-like -behavior also wouldn’t explain the mac duplication triggering as switches least appear to be learning macs.

I would appreciate any thoughts on what is maybe going on here, e.g. is there maybe a known reason why the src mac would differ from what is on the NIC when using -R with ib_send_b with roce lag and what troubleshooting steps i should consider taking next? Why the src mac behavior might differ depending whether -R is used?

Cards are MCX755106AS-HEA_Ax, running 28.39.3004 FW on a Alma Linux 9.6 with kernel 5.14.0-570.44.1.el9_6.x86_64, ib_send_bw version is 6.23

Any insight on the matter is greatly appreciated,

Vesa

It seems that changing the /sys/class/net/$interface_name/compat/devlink/lag_port_select_mode from queue_affinity to hash stops this from happening.

Does someone know what exactly are the differences between these to lag modes and how they impact the rdma_cm behavior? The available documentation is little bit vague in this regard:

hash mode(this is quite clear, looks like normal n-tuple of headers): In this mode, packets are distributed to ports according to the hash on packet headers.

queue affinity mode: In this mode, packets are distributed according to the QPs.

Document doesn’t really explain what the impact of queue affinity is and why it might have this kind of impact when using the -R flag of ib_send_bw.

Any insights are appreciated.

Vesa