Drastically low performance on ConectX6 adapters with RHEL 7.9 with comparison to older adapters

tomasz.kucharski1 · November 30, 2023, 10:26am

Hello,
We have weird situation on our HPC clusters (more than one) when node in cluster has ConnectX-6 adapter.
The performance is very low when running multi-node job with at least one server which is equipped with ConnectX-6 adapter.
It is worth to mention we are relying on RHEL 7.9 image, Lenovo servers SR630V1, Mellanox SR6036 switch and only Intel processors like:
Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
We have tried many different drivers like:
MLNX_OFED_LINUX-5.5-1.0.3.2-rhel7.9-x86_64
MLNX_OFED_LINUX-5.8-3.0.7.0-rhel7.9-x86_64
MLNX_OFED_LINUX-23.07-0.5.1.2-rhel7.9-x86_64
with no improvement.
it is even more weird that when I replace the ConnectX-6 adapter with old ConnectX-3 and changing drivers to 4.6 or 4.9 node immediately beringed immediately very good performance.
Additionally, tests were done on LSdyna, Radioss solvers with IntelMPI, PlatformMPI, openMPI.
Is there any similar issue already identified, or maybe even resolved?
Thanks
Tomek

spruitt · November 30, 2023, 8:38pm

Hello,

Comparing CX3 with 4.x driver with CX6 5.x & 23.x is not a good comparison reference.

As a starting point, install our latest MLNX_OFED 23.10 and make sure the FW is aligned with a supported FW.

https://docs.nvidia.com/networking/display/mlnxofedv23100550/general+support#src-2396914931_GeneralSupport-SupportedNICFirmwareVersions

(All participating nodes will need to be aligned).

Validate that the BIOS & OS has been tuned based on basic tuning deployment (if you search on performance within our community, several articles are available).

Confirm that you can reach rate line using our RDMA tool(s) embedded in our driver (IE: ib_write/read/send_bw).

If rate line is reached, use our MLNX_OFED OpenMPI OSU benchmarks to evaluate the performance ( Benchmarks location: /usr/mpi/gcc/openmpi-xxxxx/tests/osu-micro-benchmarks-xx).

At last, you have the option to open a support case should you have a contract in place with Nvidia and we will gladly assist you further.

https://enterprise-support.nvidia.com/s/article/NVIDIA-Enterprise-Support-Guide-for-New-Users

Sophie.

Topic		Replies	Views
Low Bandwith with Connect6X Adapters InfiniBand/VPI Adapter Cards	0	122	November 6, 2024
ConnectX-6 LX compatible with RHEL 7.6? Ethernet Adapter Cards	2	192	August 6, 2024
ConnectX VPI operating system compatibility Adapters and Cables	1	274	October 11, 2021
Need good performing driver for ConnectX 3 with RHEL 7.8 (with latest patches). Software And Drivers	1	240	September 15, 2020
ConnectX-3 not going up in Centos 6.4 (and SL6.4) InfiniBand/VPI Adapter Cards	22	663	June 19, 2013
Trouble with ConnectX-3 VPI adapter card over XenServer 6.2 (Service Pack 1)	11	430	June 3, 2014
Connect X-6 card LED not turning on and ports are down always InfiniBand/VPI Adapter Cards	3	520	August 13, 2024
how to Getting started with ConnectX-6 200Gb/s Adapter for Linux Software And Drivers	1	393	December 18, 2019
What is the correct driver for ConnectX-4 LX and ConnectX-6 LX Cards? Mellanox OFED	4	1242	September 18, 2023
Does connectX-6 LX support infiniband? InfiniBand/VPI Adapter Cards	1	100	August 27, 2024

Drastically low performance on ConectX6 adapters with RHEL 7.9 with comparison to older adapters

Related topics