Whenever the machine restarts, the single-port bidirectional test rate will decrease by 15%, or an error will be reported

346283191 · April 10, 2023, 11:36am

MCX516A-CDAT, 100G dual port. On Supermicro 4124-GS-TNR, the general situation is that I use ib_write_bw for single-port bidirectional testing.
The first speed test, 196Gb/s is no problem, then I restarted, it became 165Gb/s directly, and then I restarted it may still be 165Gb/s, but after restarting again, it changed back to 196Gb/s
And when the speed measurement is 165Gb/s, it is often accompanied by ib_write_bw speed measurement stuck. It is stuck after the command is executed, maybe both ports are stuck, or the second port can run
MCX516A-CDAT, 100G dual port. On Supermicro 4124, the general situation is that I use ib_write_bw for single-port bidirectional testing.
The first speed test, 196Gb/s is no problem, then I restarted, it became 165Gb/s directly, and then I restarted it may still be 165Gb/s, but after restarting again, it changed back to 196Gb/s
And when the speed measurement is 165Gb/s, it is often accompanied by ib_write_bw speed measurement stuck. It is stuck after the command is executed, maybe both ports are stuck, or the second port can run
server [d1]:
ib_write_bw -d mlx5_0 -R --run_infinitely --report_gbits
client[d2]:
ib_write_bw -d mlx5_0 -R --run_infinitely -b --report_gbits 192.168.200.25

When the test rate is normal, it shows the rate in the screenshot

When it is not normal, the machine port is stuck, unable to test, and then the following error is reported. And what I do is just restart the machine, and then execute mst start

Of course, it is also possible to test normally after restarting, but the rate will drop to 165Gb/s, and will not increase, and will maintain this rate

I uploaded the file running ibdiagnet --pc --pm_pause_time 600 -P all=1 --get_phy_info --get_cable_info and running sysinfo-snapshot.py to the attachment
d1.zip (2.2 KB)
d2.zip (2.2 KB)
sysinfo-snapshot-v3.7.0-d1-20230410-112904.tgz (4.9 MB)
sysinfo-snapshot-v3.7.0-d2-20230410-112901.tgz (5.4 MB)

vikiz · April 11, 2023, 9:59am

Hello,

It seems to be about ARP resolution tables issue due to the fact that you have 2 IPs from the same subnet:

enp161s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4200
inet 192.168.200.27 netmask 255.255.255.0 broadcast 192.168.200.255
inet6 fe80::1270:fdff:fe30:b060 prefixlen 64 scopeid 0x20
ether 10:70:fd:30:b0:60 txqueuelen 1000 (Ethernet)
RX packets 6340 bytes 831225 (831.2 KB)
RX errors 0 dropped 5943 overruns 0 frame 0
TX packets 18 bytes 1356 (1.3 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp161s0f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4200
inet 192.168.200.28 netmask 255.255.255.0 broadcast 192.168.200.255
inet6 fe80::1270:fdff:fe30:b061 prefixlen 64 scopeid 0x20
ether 10:70:fd:30:b0:61 txqueuelen 1000 (Ethernet)
RX packets 6342 bytes 831404 (831.4 KB)
RX errors 0 dropped 5944 overruns 0 frame 0
TX packets 19 bytes 1416 (1.4 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

RoCE is affected in the connection establishment phase.

The problem:

“When a Linux box is connected to a network segment with multiple network cards, a potential problem with the link layer address to IP address mapping can occur. The machine may respond to ARP requests from both Ethernet interfaces. On the machine creating the ARP request, these multiple answers can cause confusion, or worse yet, non-deterministic population of the ARP cache. Known as ARP flux, this can lead to the possibly puzzling effect that an IP migrates non-deterministically through multiple link layer addresses. It’s important to understand that ARP flux typically only affects hosts which have multiple physical connections to the same medium or broadcast domain.” (2.1. Address Resolution Protocol (ARP))

In general, when there are 2 interfaces on the same subnet there is no assurance as to which interface will be used to transmit traffic and the machine will accept traffic for either IP on either interface.

In some cases Applications which use a specific interface for communication are expecting the same interface to be used on the return path.

If you ping with -I dev , attempting to use a given interface, there is no guarantee the reply packet (if there even is one) will come back to the same interface, so pings done with -I dev may not work.

RoCE connection establishment is also one of them.

Illustration of the problem statement: Avoiding ARP Flux in Multi-Interface Linux Hosts

Explanation of all the ARP tunables: Chapter 2. Working with sysctl and kernel tunables Red Hat Enterprise Linux 7 | Red Hat Customer Portal

Explanation:

OS/kernel handles the arp resolution and routing tables and algorithms. When looking at standard routing table, without additional setting or effort, we will see several entries with same IP subnet pointing to different interfaces.

OS will route the packets to chosen default interface without additional setting. Usually that interface will be the upper most in the routing table.

In order to prevent from wrong arp resolution there is a need in source based routing configuration + correct sysctl settings.

Or, even better- to use IPs from different subnets on each server (meaning first port on one subnet and second port on another subnet).

Without those, arp might not be resolved at all on specific route or will be resolved on wrong interface.

Even if arp table is populated correctly, but the actual handshake that is done for RDMA is done on a wrong interface due to wrong routing of standard IP packets, it might result in RDMA being issued on wrong interface and thus eventually in actual failures.

Best Regards,
Viki

346283191 · April 11, 2023, 10:07am

I understand, I will try to reset according to what you said, but I still have a problem, sometimes after restarting, the speed test is 165Gb/s. why is this

vikiz · April 16, 2023, 10:16am

Hi, make sure that all the CPUs are free and that the previous job finished completely before running a new job.
Also, try to run with more queues (-q 2) or use taskset to bind the sender to close NUMA cores.

Best Regards,
Viki

system · April 30, 2023, 10:17am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Poor bandwidth performance when running with large block size Ethernet Adapter Cards iterations , bytes	9	1539	May 28, 2018
Infiniband performance tuning InfiniBand/VPI Adapter Cards iterations , bytes	2	1335	June 7, 2017
IP over Infiniband @ FreeBSD 11.2: fatal Kernel trap 12 after packet length >2044 bytes in connected mode Software And Drivers infiniband , ip	3	697	September 9, 2019
We have a few servers with MCX623106AS-CDAT Ethernet 100Gb 2-port QSFP56 cards. Is there any published performance baseline for these cards? What am I supposed to see if I run a raw_ethernet_bw test between 2 of these? Software And Drivers performance , ethernet , iterations , bytes , tx	3	870	September 14, 2021
Low throughput on Connectx-3 on both Linux and Windows InfiniBand/VPI Adapter Cards	3	1878	September 5, 2019
ISR9024D-M InfiniBand/VPI Switch Systems	14	420	June 26, 2013
ConnectX-4 RoCE speed less than expected Ethernet Adapter Cards	7	1509	July 14, 2023
Dropping lots of UDP packets with simple TX1 configuration Jetson TX1	15	3316	October 18, 2021
ib_send_bw performance puzzle Mellanox OFED iterations , bytes	4	3112	April 27, 2016
Read port priority counters is 0 . InfiniBand/VPI Switch Systems iterations , bytes	5	682	November 2, 2015

Whenever the machine restarts, the single-port bidirectional test rate will decrease by 15%, or an error will be reported

Related topics