Hi
I am running into some issues with openmpi compilation and looking for general advice for the setup I described below.
The fabric I am working with has 3 Xeon 28-core workstations housing Mellanox ConnectX-3 VPI MCX354A-FCBT NICs and connected through a Mellanox SX6005 switch.
I am trying to build OpenMPI so I can compile codes that will essentially use the hardware resources (CPU,RAM) of all 3 workstations.
All machines have:
ConnectX-3s have latest fw 2.42.5000
I want to use intel compilers (2022.01 Intel OneAPI) to compile openMPI and my own codes.
I have “successfully” installed the MLNX_OFED_LINUX-4.9-4.1.7.0-ubuntu20.04-x86_64 drivers on all nodes (Linux 5.4.0-26-generic (ubuntu 20.04))
I installed the MLNX_OFED_LINUX-4.9-4.1.7.0-ubuntu20.04-x86_64
I used ./mlnxofedinstall --force
Maybe this is the root of all issues. I will try to install with apt-get with mlnx-ofed-all option.
1-) The overview (https://docs.nvidia.com/networking/display/HPCXv281/HPC-X+Overview) of the HPCX 2.8.1 lists
OFED / MLNX_OFED: OFED 1.5.3 and later and MLNX_OFED 4.7-x.x.x.x and later as requirements.
In the HPC-X download center for HPCX-2.8.1 I see the options below.
hpcx" data-fileid="0691T00000GSSRyQAP
You can see in the picture I attached that there is no MLNX_OFED 4.9 listed for HPC-X 2.8.1 as suggested in the HPC-X 2.8.1 overview.
Is there such a HPC-X version? If not, which one of the available options should I use? I am currently using hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64
2-) Options in OpenMPI compilation
–with-ucx=: Build support for the UCX library.
–with-mxm=: Build support for the Mellanox Messaging (MXM) library (starting with the v1.5 series).
–with-verbs=: Build support for OpenFabrics verbs (previously known as “Open IB”, for Infiniband and iWARP networks).
https://www.open-mpi.org/faq/?category=openfabrics says:
“In the v4.0.x series, Mellanox InfiniBand devices default to the ucx PML. The use of InfiniBand over the openib BTL is officially deprecated in the v4.0.x series, and is scheduled to be removed in Open MPI v5.0.0.”
Also:
- Does Open MPI support MXM?
MXM support is currently deprecated and replaced by UCX.
Do I use all the options above or can I configure OpenMPI compilation with just --with-ucx ? I compiled ucx1.10 that’s in the hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/sources/ucx-1.10.0.tar.gz - attached the config log for that build (ucx_1_10_config.log)
3-) How about --with-fca?
open-mpi.org says: “You can find more information about FCA on the product web page. FCA is available for download here: http://www.mellanox.com/products/fca” but I was not able to find anything on that link.
In any case I have tried to configure and compile OpenMPI in hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/sources/openmpi-gitclone.tar.gz
./configure CC=icc CXX=icpc F77=ifort FC=ifort --prefix=${HPCX_HOME}/ompi-icc \
–with-ucx=/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ucx-1.10.0/build \
–with-platform=contrib/platform/mellanox/optimized \
2>&1 | tee config-icc-output.log
make all
sudo make install
I have attached my config output log (config-icc-output.log)
I am receiving an error at make install as follows:
/usr/bin/mkdir -p ‘/ompi-icc/share/openmpi’
/usr/bin/install -c -m 644 help-mpi-common-sm.txt ‘/ompi-icc/share/openmpi’
/usr/bin/mkdir -p ‘/ompi-icc/include/openmpi/opal/mca/common/sm’
/usr/bin/install -c -m 644 common_sm.h common_sm_mpool.h ‘/ompi-icc/include/openmpi/opal/mca/common/sm’
make[3]: Leaving directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/sm’
make[2]: Leaving directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/sm’
Making install in mca/common/ucx
make[2]: Entering directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/ucx’
LN_S libmca_common_ucx.la
make[3]: Entering directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/ucx’
/usr/bin/mkdir -p ‘/ompi-icc/lib’
/bin/bash …/…/…/…/libtool --mode=install /usr/bin/install -c libmca_common_ucx.la ‘/ompi-icc/lib’
libtool: warning: relinking ‘libmca_common_ucx.la’
libtool: install: (cd /home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/ucx; /bin/bash “/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/libtool” --silent --tag CC --mode=relink icc -DNDEBUG -O3 -g -finline-functions -fno-strict-aliasing -restrict -Qoption,cpp,–extended_float_types -pthread -version-info 70:0:30 -L/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ucx-1.10.0/build/lib -o libmca_common_ucx.la -rpath /ompi-icc/lib libmca_common_ucx_la-common_ucx.lo -lucp -luct -lucm -lucs /home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/libopen-pal.la /home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/memory/libmca_memory.la -lrt -lz )
/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/libtool: line 10657: icc: command not found
libtool: error: error: relink ‘libmca_common_ucx.la’ with the above command before installing it
make[3]: *** [Makefile:1875: install-libLTLIBRARIES] Error 1
make[3]: Leaving directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/ucx’
make[2]: *** [Makefile:2103: install-am] Error 2
make[2]: Leaving directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal/mca/common/ucx’
make[1]: *** [Makefile:2420: install-recursive] Error 1
make[1]: Leaving directory ‘/home/baird/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/openmpi-gitclone/opal’
make: *** [Makefile:1902: install-recursive] Error 1
Any help would be greatly appreciated, especially any advice to set up this infiniband fabric I have described in the context!
Cheers
Onur
config-icc-output.log (207 KB)
ucx_1_10_config.log (467 KB)