Nvshmem_runtime_error

I’m compiling a basic nvshmem code for testing the FFT in polaris cluster. There MPI installed is CRAY_MPICH . so i made the bootstrap plugin for NVSHMEM for installed MPI. It compiled successfully. But when i ran the code it shows error ::

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
	libmpi.so.40: cannot open shared object file: No such file or directory

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
	libmpi.so.40: cannot open shared object file: No such file or directory

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed 

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
	libmpi.so.40: cannot open shared object file: No such file or directory

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed 

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed 

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
	libmpi.so.40: cannot open shared object file: No such file or directory

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed 

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread 
x3004c0s13b1n0.hsn.cm.polaris.alcf.anl.gov: rank 2 exited with code 255

Also note that i already set the NVSHMEM_BOOSTRAP_PLUGIN variable correctly for nvshmem to load the correct bootstarp.

My compile line is ::

nvcc  -std=c++14 -arch=sm_80 fft.cu  -I /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/12.2//include/,/opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/include/,/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/  -L /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/12.2//lib/,/opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/lib/,/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/lib/,/opt/cray/pe/pmi/6.1.13/lib -L/opt/cray/pe/mpich/8.1.28/gtl/lib -lmpi -lcufft -lnvshmem -lnvidia-ml -lcuda -lpmi -lmpi_gtl_cuda -Wno-deprecated-gpu-targets  -o FFT

I dont know why it is still linking to openmpi rather than MPICh as libmpi.so.40 is openmpi library while libmpi.so.12 is mpich library

There MPI installed is CRAY_MPICH . so i made the bootstrap plugin for NVSHMEM for installed MPI. It compiled successfully. But when i ran the code it shows error

Did you make sure that when you build nvshmem_bootstrap_mpi.so plugin, you built it using CRAPY MPICH and not OpenMPI ? Can you do ldd nvshmem_bootstrap_mpi.so to confirm the link-time dependency on the correct MPI library ?

libnvshmem.a will try to dlopen nvshmem_bootstrap_mpi.so, so as long as the bootstrap plugin that you built is linking against the correct MPI library, you shouldn’t see this problem.

 linux-vdso.so.1 (0x00007ffdd058f000)
        libmpi_gtl_cuda.so.0 => /opt/cray/pe/mpich/8.1.28/gtl/lib/libmpi_gtl_cuda.so.0 (0x00007f280098a000)
        libpmi.so.0 => /opt/cray/pe/pmi/6.1.13/lib/libpmi.so.0 (0x00007f2800967000)
        libmpi_nvidia.so.12 => /opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/lib/libmpi_nvidia.so.12 (0x00007f27fe49e000)
        libnvomp.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib/libnvomp.so (0x00007f27fd400000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f27fe47c000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f27fe458000)
        libnvcpumath.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib/libnvcpumath.so (0x00007f27fce00000)
        libnvc.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib/libnvc.so (0x00007f27fca00000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f27fc809000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f27fe432000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f27fd2b4000)
        libcudart.so.12 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/lib64/libcudart.so.12 (0x00007f27fc400000)
        libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f27fa794000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f27fa54f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f28009d8000)
        libpals.so.0 => /opt/cray/pals/1.3.4/lib/libpals.so.0 (0x00007f27fe428000)
        libfabric.so.1 => /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1 (0x00007f27fcd01000)
        libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00007f27fe41e000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f27fe414000)
        libpmi2.so.0 => /opt/cray/pe/pmi/6.1.13/lib/libpmi2.so.0 (0x00007f27fd291000)
        libnvf.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib/libnvf.so (0x00007f27f9e00000)
        libjansson.so.4 => /usr/lib64/libjansson.so.4 (0x00007f27fe403000)
        libcxi.so.1 => /usr/lib64/libcxi.so.1 (0x00007f27fd26b000)
        libcurl.so.4 => /usr/lib64/libcurl.so.4 (0x00007f27fc75f000)
        libjson-c.so.3 => /usr/lib64/libjson-c.so.3 (0x00007f27f9a00000)
        libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00007f27f9600000)
        libnghttp2.so.14 => /usr/lib64/libnghttp2.so.14 (0x00007f27fd242000)
        libidn2.so.0 => /usr/lib64/libidn2.so.0 (0x00007f27f9200000)
        libssh.so.4 => /usr/lib64/libssh.so.4 (0x00007f27fcc92000)
        libpsl.so.5 => /usr/lib64/libpsl.so.5 (0x00007f27f8e00000)
        libssl.so.1.1 => /usr/lib64/libssl.so.1.1 (0x00007f27fc6c0000)
        libcrypto.so.1.1 => /usr/lib64/libcrypto.so.1.1 (0x00007f27f8ac1000)
        libgssapi_krb5.so.2 => /usr/lib64/libgssapi_krb5.so.2 (0x00007f27f9dae000)
        libldap_r-2.4.so.2 => /usr/lib64/libldap_r-2.4.so.2 (0x00007f27f9d59000)
        liblber-2.4.so.2 => /usr/lib64/liblber-2.4.so.2 (0x00007f27fcc82000)
        libzstd.so.1 => /usr/lib64/libzstd.so.1 (0x00007f27f9c28000)
        libbrotlidec.so.1 => /usr/lib64/libbrotlidec.so.1 (0x00007f27f8800000)
        libz.so.1 => /usr/lib64/libz.so.1 (0x00007f27fcc69000)
        libunistring.so.2 => /usr/lib64/libunistring.so.2 (0x00007f27f8400000)
        libjitterentropy.so.3 => /usr/lib64/libjitterentropy.so.3 (0x00007f27f8000000)
        libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x00007f27f9926000)
        libk5crypto.so.3 => /usr/lib64/libk5crypto.so.3 (0x00007f27fc6a9000)
        libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007f27f7c00000)
        libkrb5support.so.0 => /usr/lib64/libkrb5support.so.0 (0x00007f27fa540000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f27fa528000)
        libsasl2.so.3 => /usr/lib64/libsasl2.so.3 (0x00007f27fa50a000)
        libbrotlicommon.so.1 => /usr/lib64/libbrotlicommon.so.1 (0x00007f27f7800000)
        libkeyutils.so.1 => /usr/lib64/libkeyutils.so.1 (0x00007f27f7400000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f27f7000000)
        libpcre.so.1 => /usr/lib64/libpcre.so.1 (0x00007f27f6c00000)

This is output of LDD bootstrap_plugin .so library. It links correct mpi library but still gives this runtime error