I’m compiling a basic nvshmem code for testing the FFT in polaris cluster. There MPI installed is CRAY_MPICH . so i made the bootstrap plugin for NVSHMEM for installed MPI. It compiled successfully. But when i ran the code it shows error ::
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
libmpi.so.40: cannot open shared object file: No such file or directory
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
libmpi.so.40: cannot open shared object file: No such file or directory
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed
/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
libmpi.so.40: cannot open shared object file: No such file or directory
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed
/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed
/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap_loader.cpp:45: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
libmpi.so.40: cannot open shared object file: No such file or directory
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/bootstrap/bootstrap.cpp:29: non-zero status: -1 bootstrap_loader_init returned error
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:246: non-zero status: 7 bootstrap_init failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:978: non-zero status: 7 nvshmem_bootstrap failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:99: non-zero status: 7 nvshmem_internal_init_thread failed
/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/12.2/nvshmem/include/host/nvshmemx_api.h:57: non-zero status: 7: No such file or directory, exiting... aborting due to error in nvshmemi_init_thread
x3004c0s13b1n0.hsn.cm.polaris.alcf.anl.gov: rank 2 exited with code 255
Also note that i already set the NVSHMEM_BOOSTRAP_PLUGIN variable correctly for nvshmem to load the correct bootstarp.
There MPI installed is CRAY_MPICH . so i made the bootstrap plugin for NVSHMEM for installed MPI. It compiled successfully. But when i ran the code it shows error
Did you make sure that when you build nvshmem_bootstrap_mpi.so plugin, you built it using CRAPY MPICH and not OpenMPI ? Can you do ldd nvshmem_bootstrap_mpi.so to confirm the link-time dependency on the correct MPI library ?
libnvshmem.a will try to dlopen nvshmem_bootstrap_mpi.so, so as long as the bootstrap plugin that you built is linking against the correct MPI library, you shouldn’t see this problem.
it seems like the old one is still getting picked up by dlopen. This will be based on your LD_LIBRARY_PATH setting (see man dlopen). It’s likely that the path where the old bootstrap exists is in your LD_LIBRARY_PATH based on the modules you may have loaded.
Is the bootstrap plugin you build named nvshmem_bootstrap_mpi.so? If so, there are two things you can try to to narrow this down.
Set the value of NVSHMEM_BOOTSTRAP_MPI_PLUGIN to the absolute path of your custom built plugin
Change the name of your plugin to something else and set the value of NVSHMEM_BOOTSTRAP_MPI_PLUGIN to the name (or preferably the absolute path) of your bootstrap plugin?
If your plugin is already named something else, it’s definitely not the one getting picked up by dlopen. Please try number two above.