MCX556A-EDAT: Direct Connection via Ethernet unable to reach more than 73GBit/s

Hello,

I’m trying to reach 100G over 2 directly connected MCX556A cards. I am using OFED 5.4.1 on Linux Centos 7.9 with stock kernel (3.10.0-1160).

I have executed mlnx_tune and set additional parameters:

sysctl net.core.rmem_max=2147483647

sysctl net.core.wmem_max=2147483647

sysctl net.ipv4.tcp_rmem=“4096 87380 2147483647”

sysctl net.ipv4.tcp_wmem=“4096 65536 2147483647”

sysctl net.core.netdev_max_backlog=250000

don’t cache ssthresh from previous connection

sysctl net.ipv4.tcp_no_metrics_save=1

Explicitly set htcp as the congestion control: cubic buggy in older 2.6 kernels

sysctl net.ipv4.tcp_congestion_control=htcp

If you are using Jumbo Frames, also set this

sysctl net.ipv4.tcp_mtu_probing=1

recommended for CentOS7/Debian8 hosts

sysctl net.core.default_qdisc=fq

The hosts are 2x AMD EPYC 7542 with 1 TB Memory, htop and top show utiliziation during tests of 1-2%. The CPU is configured for 4 NUMA nodes, and the adapter is bound to the corresponding one. The Adapter is connected via PCIex4 x16. RPS and XPS cpus are pinned.

The eth interfaces are set to mtu 9000.

I’m testing with iperf, iperf3 and raw_ethernet_bw. The maximum I was able to achieve was 73 Gbit/s. iperf and iperf3 are run as separate processes, I have tried from 2 to 8 processes and everytime the same result. I do not see some retries from iperf3, but they are around 200 - 500 for a 30 second test.

I did the same test with a switch in between (Dell S5232F-ON) there I had much high retries, around 50k.

I have tested by reducing the link to 50G and 25G and both times I can reach the maximum speeds (46.3 Gbit and 23.2 Gbits) - so I would expect 4x 23.2 Gbits, so around 92.8 Gbits.

Locally (lo interface) I can easily reach 190 Gbits Send/Receive.

I have followed the tuning guidelines:

https://community.mellanox.com/s/article/performance-tuning-for-mellanox-adapters

https://community.mellanox.com/s/article/how-to-tune-an-amd-server--eypc-cpu--for-maximum-performance

I will test still a different cable, but mlxlink doesn’t report any issues.

What else can be checked? How can I find out WHAT is limiting the performance here?

How can I test a loopback configuration?​

Hi Rosenstein,

Thank you for posting your question on our community.

As you mentioned these are AMD CPU based hosts, can you please confirm the below two requirements are met as these help improve performance on AMD based CPU’s:

a. GRUB command line used have “iommu=pt” . Please share output of #cat /proc/cmdline

b. Are all DIMM’s populated?

In addition, as you are using OFED 5.4, I believe you have the latest firmware installed unless you installed the driver using "–without-fw-update " flag.

In case you have the above parameters in place and still see reduced performance, we will open a support ticket as I see your account holds a valid support contract.

Thanks,

Namrata.

Thanks,

Namrata.

Edit: it seems to actually have worked, I can now reach 92.4 GBit/s via ethernet, same as via RDMA

@Namrata Motihar​

Hi, very sorry I have not responded earlier, I actually did not see your post!

I have added iommu=pt, but it did not change anything - we are not using SRVIO - plain bare metal hardware

cmdline:

BOOT_IMAGE=/vmlinuz-5.10.37 root=/dev/mapper/cl-root ro crashkernel=896M rd.lvm.lv=cl/root net.ifnames=0 biosdevname=0 scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=y mitigations=off console=tty0 console=ttyS1,115200 iommu=pt

Please disregard the 5.10.37 kernel here, I have rebooted into the up2date kernel, cmdline is the same

b) 8 Dimms are populated per CPU:

description: DIMM DDR4 Synchronous Registered (Buffered) 3200 MHz (0.3 ns)

product: HMAA8GR7MJR4N-XN

vendor: Hynix Semiconductor (Hyundai Electronics)

physical id: 17

serial: 933237DF

slot: B8

size: 64GiB

width: 64 bits

clock: 3200MHz (0.3ns)

I can reach 99 Gbit/s via Infiniband (ib0) and 91 Gbit/s via Ethernet (eth2) when using ib_read_bw / ib_write_bw

Using iperf3 or iperf I max out around 60 - 70 Gbits (mtu 9000)

We do have an active support contract, currently until next week.