Bug in MPI installation in HPC SDK 20.11?

When I try a simple mpif90 compilation with MPI bundled with Nvidia HPC SDK 20.11, I get

/opt/nvidia-20.11/hpc_sdk/Linux_x86_64/20.11/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/mpif90: error while loading shared libraries: librdmacm.so.1: cannot open shared object file: No such file or directory

In /opt/nvidia-20.11/hpc_sdk/Linux_x86_64/20.11/comm_libs/mpi/lib, ‘ll | wc’ responds with 58 lines and librdmacm* is not there.

In the installation tarball in install_components/Linux_x86_64/20.11/comm_libs/mpi/lib, ‘ll | wc’ responds with 64 lines and librdmacm* (and libibverbs*) are there.

When I copy the missing files to the lib directory of MPI from the tarball by hand, mpif90 can compile my source.

With Intel® Core™ i9-9920X CPU @ 3.50GHz (i9 Skylake-X generation), the response to mpif90 in case of incomplete lib directory was different:

/opt/nvidia-20.11/hpc_sdk/Linux_x86_64/20.11/comm_libs/openmpi/openmpi-3.1.5/lib/libopen-pal.so.40: undefined reference to rdma_get_src_port@RDMACM_1.0
/opt/nvidia-20.11/hpc_sdk/Linux_x86_64/20.11/comm_libs/openmpi/openmpi-3.1.5/lib/libopen-pal.so.40: undefined reference to rdma_get_dst_port@RDMACM_1.0

and also disappeared after completing the libraries by hand.

Did I miss something during the installation, or did I make other mistake?

Hi lahan,

Appologies for the late response. The person I needed to ask about this was on vacation until today.

It does appear that we missed shipping librdmacm with the 20.11 package so we have opened a problem report (TPR #29418) to get it resolved.

In the meantime, you can install OFED separately from https://downloads.openfabrics.org/OFED/ which will include these libraries.

-Mat

Hi,

I think I have a similar issue because when I try to compile with mpif90 I get:

/opt/nvidia/hpc_sdk/Linux_x86_64/20.11/comm_libs/openmpi4/openmpi-4.0.5/bin/.bin/mpif90: error while loading shared libraries: libucp.so.0: cannot open shared object file: No such file or directory

Or:

/opt/nvidia/hpc_sdk/Linux_x86_64/20.11/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/mpif90: error while loading shared libraries: libnvcpumath.so: cannot open shared object file: No such file or directory

Can you help me? Thank you

Anne

Hi Anne,

Did you install OFED from the link above? libucp.so should be part of that package.

Note that if you did install OFED but installed it in a non-standard directory, you may need to set your environment’s LD_LIBRARY_PATH to include the installation directory of the libraries.

For “libnvcpumath.so”, this library is located with the compilers in the “/opt/nvidia/hpc_sdk/Linux_x86_64/20.11/compilers/lib” directory. Please add this path to your LD_LIBRARY_PATH.

-Mat

Sorry for the late response. Indeed, I forget to add this path to my LD_LIBRARY_PATH… It works now, thank you!

Anne