Building openMPI with UCX - General Advice

Hi

I am running into some issues with openmpi compilation and looking for general advice for the setup I described below.

The fabric I am working with has 3 Xeon 28-core workstations housing Mellanox ConnectX-3 VPI MCX354A-FCBT NICs and connected through a Mellanox SX6005 switch.

I am trying to build OpenMPI so I can compile codes that will essentially use the hardware resources (CPU,RAM) of all 3 workstations.

All machines have:

ConnectX-3s have latest fw 2.42.5000

I want to use intel compilers (2022.01 Intel OneAPI) to compile openMPI and my own codes.

I have “successfully” installed the MLNX_OFED_LINUX-4.9-4.1.7.0-ubuntu20.04-x86_64 drivers on all nodes (Linux 5.4.0-26-generic (ubuntu 20.04))

I installed the MLNX_OFED_LINUX-4.9-4.1.7.0-ubuntu20.04-x86_64

I used ./mlnxofedinstall --force

Maybe this is the root of all issues. I will try to install with apt-get with mlnx-ofed-all option.

1-) The overview (https://docs.nvidia.com/networking/display/HPCXv281/HPC-X+Overview) of the HPCX 2.8.1 lists

OFED / MLNX_OFED: OFED 1.5.3 and later and MLNX_OFED 4.7-x.x.x.x and later as requirements.

In the HPC-X download center for HPCX-2.8.1 I see the options below.

hpcx" data-fileid="0691T00000GSSRyQAP

You can see in the picture I attached that there is no MLNX_OFED 4.9 listed for HPC-X 2.8.1 as suggested in the HPC-X 2.8.1 overview.

Is there such a HPC-X version? If not, which one of the available options should I use? I am currently using hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64

2-) Options in OpenMPI compilation

–with-ucx=: Build support for the UCX library.

–with-mxm=: Build support for the Mellanox Messaging (MXM) library (starting with the v1.5 series).

–with-verbs=: Build support for OpenFabrics verbs (previously known as “Open IB”, for Infiniband and iWARP networks).

https://www.open-mpi.org/faq/?category=openfabrics says:

“In the v4.0.x series, Mellanox InfiniBand devices default to the ucx PML. The use of InfiniBand over the openib BTL is officially deprecated in the v4.0.x series, and is scheduled to be removed in Open MPI v5.0.0.”

Also:

  1. Does Open MPI support MXM?

MXM support is currently deprecated and replaced by UCX.

Do I use all the options above or can I configure OpenMPI compilation with just --with-ucx ? I compiled ucx1.10 that’s in the hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/sources/ucx-1.10.0.tar.gz - attached the config log for that build (ucx_1_10_config.log)

3-) How about --with-fca?

open-mpi.org says: “You can find more information about FCA on the product web page. FCA is available for download here: http://www.mellanox.com/products/fca” but I was not able to find anything on that link.

In any case I have tried to configure and compile OpenMPI in hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/sources/openmpi-gitclone.tar.gz

./configure CC=icc CXX=icpc F77=ifort FC=ifort --prefix=${HPCX_HOME}/ompi-icc \

–with-ucx=/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ucx-1.10.0/build \

–with-platform=contrib/platform/mellanox/optimized \

2>&1 | tee config-icc-output.log

make all

sudo make install

I have attached my config output log (config-icc-output.log)

I am receiving an error at make install as follows:


/usr/bin/mkdir -p ‘/ompi-icc/share/openmpi’

/usr/bin/install -c -m 644 help-mpi-common-sm.txt ‘/ompi-icc/share/openmpi’

/usr/bin/mkdir -p ‘/ompi-icc/include/openmpi/opal/mca/common/sm’

/usr/bin/install -c -m 644 common_sm.h common_sm_mpool.h ‘/ompi-icc/include/openmpi/opal/mca/common/sm’

make[3]: Leaving directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/sm’

make[2]: Leaving directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/sm’

Making install in mca/common/ucx

make[2]: Entering directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/ucx’

LN_S libmca_common_ucx.la

make[3]: Entering directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/ucx’

/usr/bin/mkdir -p ‘/ompi-icc/lib’

/bin/bash …/…/…/…/libtool --mode=install /usr/bin/install -c libmca_common_ucx.la ‘/ompi-icc/lib’

libtool: warning: relinking ‘libmca_common_ucx.la’

libtool: install: (cd /home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/ucx; /bin/bash “/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/libtool” --silent --tag CC --mode=relink icc -DNDEBUG -O3 -g -finline-functions -fno-strict-aliasing -restrict -Qoption,cpp,–extended_float_types -pthread -version-info 70:0:30 -L/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ucx-1.10.0/build/lib -o libmca_common_ucx.la -rpath /ompi-icc/lib libmca_common_ucx_la-common_ucx.lo -lucp -luct -lucm -lucs /home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/libopen-pal.la /home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/memory/libmca_memory.la -lrt -lz )

/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/libtool: line 10657: icc: command not found

libtool: error: error: relink ‘libmca_common_ucx.la’ with the above command before installing it

make[3]: *** [Makefile:1875: install-libLTLIBRARIES] Error 1

make[3]: Leaving directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/ucx’

make[2]: *** [Makefile:2103: install-am] Error 2

make[2]: Leaving directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/ucx’

make[1]: *** [Makefile:2420: install-recursive] Error 1

make[1]: Leaving directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal’

make: *** [Makefile:1902: install-recursive] Error 1

Any help would be greatly appreciated, especially any advice to set up this infiniband fabric I have described in the context!

Cheers

Onur

config-icc-output.log (207 KB)

ucx_1_10_config.log (467 KB)

I have solved this problem by adding a .conf file to /etc/ld.so.conf.d/

Just added the library folder to the conf file as: “/opt/intel/oneapi/compiler/2022.0.1/linux/compiler/lib/intel64_lin”

and then did ldconfig

Hello,

  1. There is no HPC-X package for the MLNX_OFED 4.9 branch. we support HPC-X only for the MLNX_OFED 5.x branch.

If there are ConnectX-3 cards in the system and you would like to use HPC-X then you can use MLNX_OFED 5.0 branch (the latest MLNX_OFED that supports ConnectX-3) and then use can use the proper HPC-X 2.8.1 package for MLNX_OFED 5.0:

https://developer.nvidia.com/networking/hpc-x

under ARCHIVE VERSIONS – > 2.8.1 → MLNX_OFED → 5.0-1.0.0.0

2.To use UCX you need only --with-ucx. MXM is deprecated and UCX replaces it.

3.FCA is deprecated, HCOLL replaces FCA.

Best Regards,

Viki

Hi Viki,

The reason I was using MLNX_OFED 4.9-4.1.7.0 LTS was because it is the latest released MLNX_OFED that supports Ubuntu 20.04 / mlnx4 and supports a newer ubuntu kernel that’s not in the MLNX_OFED 5.0 versions. Here are the release dates of each relevant release.

5.0-1.0.0.0 March 3, 2020 supports mlx4 and ucx 1.8

Ubuntu 20.04 (beta) x86_64 5.4.0-12-generic

5.0-2.1.8.0 April 6, 2020 supports mlx4 and ucx 1.8

Ubuntu 20.04 (beta) x86_64 5.4.0-18-generic

4.9-4.1.7.0 December 2021 - mlx4 / ucx 1.8

Ubuntu20.04 x86_64 5.4.0-26-generic

So in my situation I used the latest released that supports my mlx4 cards. MLNX_OFED 5.0-x supports an older kernel and in beta. So can you please confirm that I need to be using 5.0-1 ? Maybe I should at least use 5.0-2.1.8 which seems to be an update over 5.0-1.0.0?

Hi Viki,

update:

I have now installed on all 3 of my nodes (with ubuntu 20.04 5.4.0-26-generic kernel):

MLNX_OFED_LINUX-5.0-2.1.8.0-ubuntu20.04-x86_64

and

hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.0-1.0.0.0-ubuntu18.04-x86_64

MPI TESTS works fine on each individual node. My current issue is to be able to use mpirun on my main node (with opensm enabled on that main node) and use the cpu s in the other nodes.

Example :

I am on oak-rd0-linux (Main node), opensm is running, ibdiagnet does not report any warning or errors and I am trying to test using the cpu on oak-rd1-linux (host1) and oak-rd2-linux (host2) with:

mpirun -x LD_LIBRARY_PATH -np 2 -H oak-rd1-linux,oak-rd2-linux $HPCX_MPI_TESTS_DIR/examples/hello_c

Nothing happens - it seems to hang and I am not sure where to go from here. What am I doing wrong at this step and what can I check to identify to problem?

sudo ibnetdiscover output:

Topology file: generated on Tue Jan 25 16:31:42 2022

Initiated from node 0010e00001885688 port 0010e0000188568a

vendid=0x2c9

devid=0xc738

sysimgguid=0xe41d2d0300b39ee0

switchguid=0xe41d2d0300b39ee0(e41d2d0300b39ee0)

Switch 12 “S-e41d2d0300b39ee0” # “SwitchX - Mellanox Technologies” base port 0 lid 3 lmc 0

[1] "H-0010e00001885688"2 # “oak-rd0-linux HCA-1” lid 1 4xQDR

[2] "H-0010e000018d08e0"1 # “oak-rd1-linux HCA-1” lid 4 4xQDR

[3] "H-0010e00001885908"1 # “oak-rd2-linux HCA-1” lid 2 4xQDR

vendid=0x2c9

devid=0x1003

sysimgguid=0x10e0000188590b

caguid=0x10e00001885908

Ca 2 “H-0010e00001885908” # “oak-rd2-linux HCA-1”

1 “S-e41d2d0300b39ee0”[3] # lid 2 lmc 0 “SwitchX - Mellanox Technologies” lid 3 4xQDR

vendid=0x2c9

devid=0x1003

sysimgguid=0x10e000018d08e3

caguid=0x10e000018d08e0

Ca 2 “H-0010e000018d08e0” # “oak-rd1-linux HCA-1”

1 “S-e41d2d0300b39ee0”[2] # lid 4 lmc 0 “SwitchX - Mellanox Technologies” lid 3 4xQDR

vendid=0x2c9

devid=0x1003

sysimgguid=0x10e0000188568b

caguid=0x10e00001885688

Ca 2 “H-0010e00001885688” # “oak-rd0-linux HCA-1”

2 “S-e41d2d0300b39ee0”[1] # lid 1 lmc 0 “SwitchX - Mellanox Technologies” lid 3 4xQDR