Whenever the machine restarts, the single-port bidirectional test rate will decrease by 15%, or an error will be reported

MCX516A-CDAT, 100G dual port. On Supermicro 4124-GS-TNR, the general situation is that I use ib_write_bw for single-port bidirectional testing.
The first speed test, 196Gb/s is no problem, then I restarted, it became 165Gb/s directly, and then I restarted it may still be 165Gb/s, but after restarting again, it changed back to 196Gb/s
And when the speed measurement is 165Gb/s, it is often accompanied by ib_write_bw speed measurement stuck. It is stuck after the command is executed, maybe both ports are stuck, or the second port can run
MCX516A-CDAT, 100G dual port. On Supermicro 4124, the general situation is that I use ib_write_bw for single-port bidirectional testing.
The first speed test, 196Gb/s is no problem, then I restarted, it became 165Gb/s directly, and then I restarted it may still be 165Gb/s, but after restarting again, it changed back to 196Gb/s
And when the speed measurement is 165Gb/s, it is often accompanied by ib_write_bw speed measurement stuck. It is stuck after the command is executed, maybe both ports are stuck, or the second port can run
server [d1]:
ib_write_bw -d mlx5_0 -R --run_infinitely --report_gbits
client[d2]:
ib_write_bw -d mlx5_0 -R --run_infinitely -b --report_gbits 192.168.200.25

When the test rate is normal, it shows the rate in the screenshot


When it is not normal, the machine port is stuck, unable to test, and then the following error is reported. And what I do is just restart the machine, and then execute mst start

Of course, it is also possible to test normally after restarting, but the rate will drop to 165Gb/s, and will not increase, and will maintain this rate

I uploaded the file running ibdiagnet --pc --pm_pause_time 600 -P all=1 --get_phy_info --get_cable_info and running sysinfo-snapshot.py to the attachment
d1.zip (2.2 KB)
d2.zip (2.2 KB)
sysinfo-snapshot-v3.7.0-d1-20230410-112904.tgz (4.9 MB)
sysinfo-snapshot-v3.7.0-d2-20230410-112901.tgz (5.4 MB)

Hello,

It seems to be about ARP resolution tables issue due to the fact that you have 2 IPs from the same subnet:

enp161s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4200
inet 192.168.200.27 netmask 255.255.255.0 broadcast 192.168.200.255
inet6 fe80::1270:fdff:fe30:b060 prefixlen 64 scopeid 0x20
ether 10:70:fd:30:b0:60 txqueuelen 1000 (Ethernet)
RX packets 6340 bytes 831225 (831.2 KB)
RX errors 0 dropped 5943 overruns 0 frame 0
TX packets 18 bytes 1356 (1.3 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp161s0f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4200
inet 192.168.200.28 netmask 255.255.255.0 broadcast 192.168.200.255
inet6 fe80::1270:fdff:fe30:b061 prefixlen 64 scopeid 0x20
ether 10:70:fd:30:b0:61 txqueuelen 1000 (Ethernet)
RX packets 6342 bytes 831404 (831.4 KB)
RX errors 0 dropped 5944 overruns 0 frame 0
TX packets 19 bytes 1416 (1.4 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

RoCE is affected in the connection establishment phase.

The problem:

“When a Linux box is connected to a network segment with multiple network cards, a potential problem with the link layer address to IP address mapping can occur. The machine may respond to ARP requests from both Ethernet interfaces. On the machine creating the ARP request, these multiple answers can cause confusion, or worse yet, non-deterministic population of the ARP cache. Known as ARP flux, this can lead to the possibly puzzling effect that an IP migrates non-deterministically through multiple link layer addresses. It’s important to understand that ARP flux typically only affects hosts which have multiple physical connections to the same medium or broadcast domain.” (2.1. Address Resolution Protocol (ARP))

In general, when there are 2 interfaces on the same subnet there is no assurance as to which interface will be used to transmit traffic and the machine will accept traffic for either IP on either interface.

In some cases Applications which use a specific interface for communication are expecting the same interface to be used on the return path.

If you ping with -I dev , attempting to use a given interface, there is no guarantee the reply packet (if there even is one) will come back to the same interface, so pings done with -I dev may not work.

RoCE connection establishment is also one of them.

Illustration of the problem statement: Avoiding ARP Flux in Multi-Interface Linux Hosts

Explanation of all the ARP tunables: Chapter 2. Working with sysctl and kernel tunables Red Hat Enterprise Linux 7 | Red Hat Customer Portal

Explanation:

OS/kernel handles the arp resolution and routing tables and algorithms. When looking at standard routing table, without additional setting or effort, we will see several entries with same IP subnet pointing to different interfaces.

OS will route the packets to chosen default interface without additional setting. Usually that interface will be the upper most in the routing table.

In order to prevent from wrong arp resolution there is a need in source based routing configuration + correct sysctl settings.

Or, even better- to use IPs from different subnets on each server (meaning first port on one subnet and second port on another subnet).

Without those, arp might not be resolved at all on specific route or will be resolved on wrong interface.

Even if arp table is populated correctly, but the actual handshake that is done for RDMA is done on a wrong interface due to wrong routing of standard IP packets, it might result in RDMA being issued on wrong interface and thus eventually in actual failures.

Best Regards,
Viki

I understand, I will try to reset according to what you said, but I still have a problem, sometimes after restarting, the speed test is 165Gb/s. why is this

Hi, make sure that all the CPUs are free and that the previous job finished completely before running a new job.
Also, try to run with more queues (-q 2) or use taskset to bind the sender to close NUMA cores.

Best Regards,
Viki

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.