Experiencing low performance on Mellanox ConnectX-6DX

Following the optimization of the OpenShift cluster, I used ProX version 22.11 for performance evaluation, I’ve found that I’m unable to utilize more than 6GB of bandwidth. I have tested with a 64-byte frame size and achieved a maximum of 6.99 MPPS.

I’ve attempted to address this issue by adhering to the recommendations outlined in the DPDK 22.03 NVIDIA Mellanox NIC performance report available at https://fast.dpdk.org/doc/perf/DPDK_22_03_NVIDIA_Mellanox_NIC_performance_report.pdf. However, the problem persists.

Additionally, I’ve investigated packet loss at the NIC interface level and found no anomalies. The bottleneck appears to be related to packet generation, but I’m uncertain about the underlying cause.

I’m seeking advice or references on potential solutions. Should I consider updating the firmware or driver? Any insights or recommendations would be greatly appreciated.

Below are the SUT details:

Nic Model: Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]

uname -r

ethtool -i enp216s0f0np0
driver: mlx5_core
version: 5.14.0-284.54.1.rt14.339.el9_2.
firmware-version: 22.35.2000 (MT_0000000359)
bus-info: 0000:d8:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

## CPU
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  104
  On-line CPU(s) list:   0-103
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel
  Model name:            Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz
    BIOS Model name:     Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz

Operating System:

cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
ID_LIKE="rhel fedora"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 415.92.202402201450-0 (Plow)"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"

OCP Cluster

oc version
Client Version: 4.15.0-202402070507.p0.g48dcf59.assembly.stream-48dcf59
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3


Thank you for approaching us. I have reviewed your queries - please see my recommendations below.

First, even though the ConnectX-6 Dx Firmware version is an LTS release, I would recommend upgrading it to the latest stable - 22.35.3502-LTS.
You can download the version here: Firmware for ConnectX®-6 Dx | NVIDIA

Additionally, since you are referring to the DPDK 22.03 performance report, I assume this is the DPDK version you are using.
This version is rather old (released March 17, 2022).
I recommend upgrading DPDK as well, to the latest LTS version, 23.11, which can be found here: https://core.dpdk.org/download/

Once the versions are aligned to the latest, please try to evaluate the performance again.
If the issue still persists or any other issues arise, please open a case at: enterprisesupport@nvidia.com, and it will be handled according to entitlement.

Best Regards,

Thank you for reaching out.

While upgrading the firmware to the latest stable release sounds like a viable option, I’d like to understand the rationale behind moving to the new version. Are there any reported performance issues with the current firmware/driver/kernel version I’m using? If not, is it possible to achieve optimal performance with the existing setup? If you require further information to assess this, please don’t hesitate to let me know.

Regarding the DPDK version, I’ve noticed similar tuning parameters being used across different versions, albeit with variations in the parameters passed. Based on my understanding, I believe my current version should also deliver good performance.

Please advise on the next steps to proceed.22_11 similar tunings