We have weird situation on our HPC clusters (more than one) when node in cluster has ConnectX-6 adapter.
The performance is very low when running multi-node job with at least one server which is equipped with ConnectX-6 adapter.
It is worth to mention we are relying on RHEL 7.9 image, Lenovo servers SR630V1, Mellanox SR6036 switch and only Intel processors like:
Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
We have tried many different drivers like:
with no improvement.
it is even more weird that when I replace the ConnectX-6 adapter with old ConnectX-3 and changing drivers to 4.6 or 4.9 node immediately beringed immediately very good performance.
Additionally, tests were done on LSdyna, Radioss solvers with IntelMPI, PlatformMPI, openMPI.
Is there any similar issue already identified, or maybe even resolved?
Comparing CX3 with 4.x driver with CX6 5.x & 23.x is not a good comparison reference.
As a starting point, install our latest MLNX_OFED 23.10 and make sure the FW is aligned with a supported FW.
(All participating nodes will need to be aligned).
Validate that the BIOS & OS has been tuned based on basic tuning deployment (if you search on performance within our community, several articles are available).
Confirm that you can reach rate line using our RDMA tool(s) embedded in our driver (IE: ib_write/read/send_bw).
If rate line is reached, use our MLNX_OFED OpenMPI OSU benchmarks to evaluate the performance ( Benchmarks location: /usr/mpi/gcc/openmpi-xxxxx/tests/osu-micro-benchmarks-xx).
At last, you have the option to open a support case should you have a contract in place with Nvidia and we will gladly assist you further.