Mpirun 3.1.5 bundled with HPC SDK 20.11 does not run between nodes

With the hostfile

geof70 slots=1
geof30 slots=1

I can run the system default Open MPI,

/usr/bin/mpirun -np 2 -hostfile hostfile hostname

with correct output:

geof70
geof30

I can also run the MPI bundled with PGI compilers successfully:

/opt/pgi/linux86-64/19.10/mpi/openmpi-3.1.3/bin/mpirun -np 2 -hostfile hostfile hostname

Locally, I can run the MPI bundled with the Nvidia HPC SDK:

/opt/nvidia-20.11/hpc_sdk/Linux_x86_64/20.11/comm_libs/mpi/bin/mpirun -np 2 hostname

However, I have no success between machines:

/opt/nvidia-20.11/hpc_sdk/Linux_x86_64/20.11/comm_libs/mpi/bin/mpirun -np 2 -hostfile hostfile hostname

Output:

/opt/nvidia-20.11/hpc_sdk/Linux_x86_64/20.11/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/orted: error while loading shared libraries: libnvcpumath.so: cannot open shared object file: No such file or directory

The libnvcpumath.so is installed on both nodes:

ll /opt/nvidia-20.11/hpc_sdk/Linux_x86_64/20.11/compilers/lib/libnvcpumath.so
-rwxr-xr-x 1 root root 2420888 Dec 4 01:35 /opt/nvidia-20.11/hpc_sdk/Linux_x86_64/20.11/compilers/lib/libnvcpumath.so*

With the environment set as in /opt/nvidia-20.11/hpc_sdk/modulefiles/nvhpc/20.11, this does not help:

/opt/nvidia-20.11/hpc_sdk/Linux_x86_64/20.11/comm_libs/mpi/bin/mpirun -x LD_LIBRARY_PATH -x PATH -x OPAL_PREFIX -np 2 -hostfile hostfile hostname

To conclude, I can compile my sources with mpif90 of HPC SDK 20.11, but I can run my executables between nodes only with mpirun of PGI 19.10. Is there a way to have mpirun of HPC SDK 20.11 running between nodes?

SOLVED:

  • create a file: vi /etc/ld.so.conf.d/nvidia.conf
  • insert a line: /opt/nvidia/hpc_sdk/Linux_x86_64/20.11/REDIST/compilers/lib
  • save and run ldconfig

(all as root).
Sending the directory via mpirun -x LD_LIBRARY_PATH did not help.
Similar steps needed in 21.2 as well.