Nvshmem_runtime_error

I’m compiling a basic nvshmem code for testing the FFT in polaris cluster. There MPI installed is CRAY_MPICH . so i made the bootstrap plugin for NVSHMEM for installed MPI. It compiled successfully. But when i ran the code it shows error ::

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
	libmpi.so.40: cannot open shared object file: No such file or directory

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
	libmpi.so.40: cannot open shared object file: No such file or directory

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed 

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
	libmpi.so.40: cannot open shared object file: No such file or directory

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed 

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed 

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
	libmpi.so.40: cannot open shared object file: No such file or directory

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed 

/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread 
x3004c0s13b1n0.hsn.cm.polaris.alcf.anl.gov: rank 2 exited with code 255

Also note that i already set the NVSHMEM_BOOSTRAP_PLUGIN variable correctly for nvshmem to load the correct bootstarp.

My compile line is ::

nvcc  -std=c++14 -arch=sm_80 fft.cu  -I /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/12.2//include/,/opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/include/,/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/  -L /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/12.2//lib/,/opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/lib/,/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/lib/,/opt/cray/pe/pmi/6.1.13/lib -L/opt/cray/pe/mpich/8.1.28/gtl/lib -lmpi -lcufft -lnvshmem -lnvidia-ml -lcuda -lpmi -lmpi_gtl_cuda -Wno-deprecated-gpu-targets  -o FFT

I dont know why it is still linking to openmpi rather than MPICh as libmpi.so.40 is openmpi library while libmpi.so.12 is mpich library

There MPI installed is CRAY_MPICH . so i made the bootstrap plugin for NVSHMEM for installed MPI. It compiled successfully. But when i ran the code it shows error

Did you make sure that when you build nvshmem_bootstrap_mpi.so plugin, you built it using CRAPY MPICH and not OpenMPI ? Can you do ldd nvshmem_bootstrap_mpi.so to confirm the link-time dependency on the correct MPI library ?

libnvshmem.a will try to dlopen nvshmem_bootstrap_mpi.so, so as long as the bootstrap plugin that you built is linking against the correct MPI library, you shouldn’t see this problem.

 linux-vdso.so.1 (0x00007ffdd058f000)
        libmpi_gtl_cuda.so.0 => /opt/cray/pe/mpich/8.1.28/gtl/lib/libmpi_gtl_cuda.so.0 (0x00007f280098a000)
        libpmi.so.0 => /opt/cray/pe/pmi/6.1.13/lib/libpmi.so.0 (0x00007f2800967000)
        libmpi_nvidia.so.12 => /opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/lib/libmpi_nvidia.so.12 (0x00007f27fe49e000)
        libnvomp.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib/libnvomp.so (0x00007f27fd400000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f27fe47c000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f27fe458000)
        libnvcpumath.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib/libnvcpumath.so (0x00007f27fce00000)
        libnvc.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib/libnvc.so (0x00007f27fca00000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f27fc809000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f27fe432000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f27fd2b4000)
        libcudart.so.12 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/lib64/libcudart.so.12 (0x00007f27fc400000)
        libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f27fa794000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f27fa54f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f28009d8000)
        libpals.so.0 => /opt/cray/pals/1.3.4/lib/libpals.so.0 (0x00007f27fe428000)
        libfabric.so.1 => /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1 (0x00007f27fcd01000)
        libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00007f27fe41e000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f27fe414000)
        libpmi2.so.0 => /opt/cray/pe/pmi/6.1.13/lib/libpmi2.so.0 (0x00007f27fd291000)
        libnvf.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib/libnvf.so (0x00007f27f9e00000)
        libjansson.so.4 => /usr/lib64/libjansson.so.4 (0x00007f27fe403000)
        libcxi.so.1 => /usr/lib64/libcxi.so.1 (0x00007f27fd26b000)
        libcurl.so.4 => /usr/lib64/libcurl.so.4 (0x00007f27fc75f000)
        libjson-c.so.3 => /usr/lib64/libjson-c.so.3 (0x00007f27f9a00000)
        libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00007f27f9600000)
        libnghttp2.so.14 => /usr/lib64/libnghttp2.so.14 (0x00007f27fd242000)
        libidn2.so.0 => /usr/lib64/libidn2.so.0 (0x00007f27f9200000)
        libssh.so.4 => /usr/lib64/libssh.so.4 (0x00007f27fcc92000)
        libpsl.so.5 => /usr/lib64/libpsl.so.5 (0x00007f27f8e00000)
        libssl.so.1.1 => /usr/lib64/libssl.so.1.1 (0x00007f27fc6c0000)
        libcrypto.so.1.1 => /usr/lib64/libcrypto.so.1.1 (0x00007f27f8ac1000)
        libgssapi_krb5.so.2 => /usr/lib64/libgssapi_krb5.so.2 (0x00007f27f9dae000)
        libldap_r-2.4.so.2 => /usr/lib64/libldap_r-2.4.so.2 (0x00007f27f9d59000)
        liblber-2.4.so.2 => /usr/lib64/liblber-2.4.so.2 (0x00007f27fcc82000)
        libzstd.so.1 => /usr/lib64/libzstd.so.1 (0x00007f27f9c28000)
        libbrotlidec.so.1 => /usr/lib64/libbrotlidec.so.1 (0x00007f27f8800000)
        libz.so.1 => /usr/lib64/libz.so.1 (0x00007f27fcc69000)
        libunistring.so.2 => /usr/lib64/libunistring.so.2 (0x00007f27f8400000)
        libjitterentropy.so.3 => /usr/lib64/libjitterentropy.so.3 (0x00007f27f8000000)
        libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x00007f27f9926000)
        libk5crypto.so.3 => /usr/lib64/libk5crypto.so.3 (0x00007f27fc6a9000)
        libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007f27f7c00000)
        libkrb5support.so.0 => /usr/lib64/libkrb5support.so.0 (0x00007f27fa540000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f27fa528000)
        libsasl2.so.3 => /usr/lib64/libsasl2.so.3 (0x00007f27fa50a000)
        libbrotlicommon.so.1 => /usr/lib64/libbrotlicommon.so.1 (0x00007f27f7800000)
        libkeyutils.so.1 => /usr/lib64/libkeyutils.so.1 (0x00007f27f7400000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f27f7000000)
        libpcre.so.1 => /usr/lib64/libpcre.so.1 (0x00007f27f6c00000)

This is output of LDD bootstrap_plugin .so library. It links correct mpi library but still gives this runtime error

Hi,

it seems like the old one is still getting picked up by dlopen. This will be based on your LD_LIBRARY_PATH setting (see man dlopen). It’s likely that the path where the old bootstrap exists is in your LD_LIBRARY_PATH based on the modules you may have loaded.

Is the bootstrap plugin you build named nvshmem_bootstrap_mpi.so? If so, there are two things you can try to to narrow this down.

  1. Set the value of NVSHMEM_BOOTSTRAP_MPI_PLUGIN to the absolute path of your custom built plugin
  2. Change the name of your plugin to something else and set the value of NVSHMEM_BOOTSTRAP_MPI_PLUGIN to the name (or preferably the absolute path) of your bootstrap plugin?

If your plugin is already named something else, it’s definitely not the one getting picked up by dlopen. Please try number two above.