We have a few servers with MCX623106AS-CDAT Ethernet 100Gb 2-port QSFP56 cards. Is there any published performance baseline for these cards? What am I supposed to see if I run a raw_ethernet_bw test between 2 of these?

This is what I see, testing on 2 new HP dl385G10Plus v servers, with latest generation Epyc CPUs. One Arista 7800 switch between them.

Ubuntu 20.04.1 , kernel 5.4.0-81-generic

server: raw_ethernet_bw --server -d mlx5_0 -B 88:e9:a4:33:48:b1 -F --duration 20

client: raw_ethernet_bw --client -d mlx5_0 -B 88:e9:a4:20:20:d3 -E 88:e9:a4:33:48:b1 -F --duration 20

results:

Max msg size in RawEth is MTU 1518

Changing msg size to this MTU


Send BW Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : RawEth Using SRQ : OFF

PCIe relax order: ON

ibv_wr* API : OFF

TX depth : 128

CQ Moderation : 1

Mtu : 1518[B]

Link type : Ethernet

GID index : 0

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


raw ethernet header**************************************


| Dest MAC | Src MAC | Packet Type |

|------------------------------------------------------------|

| 88:E9:A4:33:48:B1| 88:E9:A4:20:20:D3|DEFAULT |

|------------------------------------------------------------|


#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

1518 42973376 0.00 6221.15 4.297334


Is this result as expected?

any suggestion?

A typical Iperf (iperf -c 192.168.1.1 -w 2m -P 32) between 2 nodes shows bandwidth fluctuating between 40 and 60 Gbps, way below what we would expect.

Thanks,

Hi Paolo,

Yes we suggest reviewing the below performance guides :

https://community.mellanox.com/s/article/getting-started-with-performance-tuning-of-mellanox-adapters

and

https://community.mellanox.com/s/article/performance-tuning-for-mellanox-adapters

Once you tune the system run the tests again , if the results are not as expected we suggest opening a new support ticket for further investigation by sending email to Networking-support@nvidia.com

Thanks,

Samer

Thanks. We did, together with a long list of tests, as per yours and AMD’s tuning whitepapers, cross-checking them with RedHat and SUSE’s tuning whitepapers.

At the moment the only way in which we can get consistent good performances is by setting the IRQs as per your guide, pinning the card to the right numa node (using your configuration script) AND pinning the iperf server process to the same numa node where the card is pinned to. With that, I get a consistent 95 Gbps.

If the iperf (server) process is not pinned to anything, performances vary quite a bit, 50 Gbps in average. if I force it to numa node 7 (the card is on 2), it goes down to 35 Gbps.

Pinning the process I need to a specific set of CPUs is not a valid solution for the production environment. We need to be able to use all the cores we have.

Things tested:

assorted kernels (5.4.xx, 5.11.xx), various drivers (your latest, inbox driver), assorted NumaPerSocket settings, all sorts of OS network stack optimizations, all sort of power governor settings, BIOS parameters around, Hardware config changes, etc etc.

What really bugs me is that if I move one of these cards to an old spare server, I immediately get excellent performance. No tuning whatsoever.

I tried opening a support ticket, it was closed right away as I do not have a direct support contract with Mellanox. All hardware was bought trough HPE (I have a case open with them). We also have a case open with AMD.

I’m really just looking for ideas on how to triage further.

Thanks,

PP