I use nvshmem in my program.
but it suddenly fails to finalize.
I use the Attribute-Based Initialization Example in Examples — NVSHMEM 2.10.1 documentation to test it.
It rasie:
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.
1: received message 0
0: received message 1
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: /dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: aborting due to error in nvshmem_finalize
aborting due to error in nvshmem_finalize
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[2802,1],1]
Exit code: 255
--------------------------------------------------------------------------
if I don not use nvshmem_finalize and nvshmem_free, the error diasppeared.
How to fix it?
For now, the only workaround is the one you mentioned in your comment. Removing nvshmem_finalize will prevent the error from finalizing.
Thanks for bringing this to our attention!
CUDA_VISIBLE_DEVICES=0,1 mpirun --allow-run-as-root -np 2 ./test
it raise:
mpirun --allow-run-as-root -np 2 ./mpi-based-init
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:481: non-zero status: 7 ep_connect failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1397: non-zero status: 7 cst setup failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:481: non-zero status: 7 ep_connect failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1397: non-zero status: 7 cst setup failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1430: non-zero status: 7 ep_connect failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1484: non-zero status: 7 transport create connect failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1430: non-zero status: 7 ep_connect failed
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1484: non-zero status: 7 transport create connect failed
/dgl/local/nvshmem_src_2.10.1-3/src/host/transport/transport.cpp:348: non-zero status: 7 endpoint connection failed
/dgl/local/nvshmem_src_2.10.1-3/src/host/transport/transport.cpp:348: non-zero status: 7 endpoint connection failed
/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:852: non-zero status: 7 nvshmem setup connections failed
/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:nvshmemi_check_state_and_init:933: nvshmem initialization failed, exiting
/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:852: non-zero status: 7 nvshmem setup connections failed
/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:nvshmemi_check_state_and_init:933: nvshmem initialization failed, exiting
/dgl/local/nvshmem_src_2.10.1-3/src/util/cs.cpp:23: non-zero status: 16: No such device, exiting... mutex destroy failed
/dgl/local/nvshmem_src_2.10.1-3/src/util/cs.cpp:23: non-zero status: 16: No such device, exiting... mutex destroy failed
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[39538,1],1]
Exit code: 255
--------------------------------------------------------------------------
I install it in /usr/local/nvshmem.but the error log riased in my build path.
here is my build log:
-- NVSHMEM_PREFIX: /usr/local/nvshmem
-- NVSHMEM_DEVEL: OFF
-- NVSHMEM_DEBUG: OFF
-- NVSHMEM_DEFAULT_PMI2: OFF
-- NVSHMEM_DEFAULT_PMIX: OFF
-- NVSHMEM_DEFAULT_UCX: OFF
-- NVSHMEM_DISABLE_COLL_POLL: ON
-- NVSHMEM_ENABLE_ALL_DEVICE_INLINING: OFF
-- NVSHMEM_ENV_ALL: OFF
-- NVSHMEM_GPU_COLL_USE_LDST: OFF
-- NVSHMEM_IBGDA_SUPPORT: ON
-- NVSHMEM_IBDEVX_SUPPORT: OFF
-- NVSHMEM_IBRC_SUPPORT: ON
-- NVSHMEM_LIBFABRIC_SUPPORT: OFF
-- NVSHMEM_MPI_SUPPORT: ON
-- MPI_HOME: /usr/local/ompi
-- NVSHMEM_NVTX: ON
-- NVSHMEM_PMIX_SUPPORT: OFF
-- NVSHMEM_SHMEM_SUPPORT: OFF
-- NVSHMEM_TEST_STATIC_LIB: OFF
-- NVSHMEM_TIMEOUT_DEVICE_POLLING: OFF
-- NVSHMEM_TRACE: OFF
-- NVSHMEM_UCX_SUPPORT: OFF
-- NVSHMEM_USE_DLMALLOC: OFF
-- NVSHMEM_USE_NCCL: ON
-- NCCL_HOME: /opt/conda/envs/pytorch-ci
-- NVSHMEM_USE_GDRCOPY: ON
-- GDRCOPY_HOME: /usr/local/gdrcopy
-- NVSHMEM_VERBOSE: OFF
-- Setting build type to '' as none was specified.
-- CUDA_HOME: /usr/local/cuda
-- The CUDA compiler identification is NVIDIA 11.8.89
-- The CXX compiler identification is GNU 11.4.0
-- The C compiler identification is GNU 11.4.0
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.8.89")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_CUDA_ARCHITECTURES: 70;80;90
-- Found MPI_C: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so (found version "3.1")
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Performing Test NVCC_THREADS
-- Performing Test NVCC_THREADS - Success
-- Performing Test HAVE_MLX5DV_UAR_ALLOC_TYPE_NC_DEDICATED
-- Performing Test HAVE_MLX5DV_UAR_ALLOC_TYPE_NC_DEDICATED - Failed
-- Configuring done (7.4s)
-- Generating done (0.2s)
-- Build files have been written to: /dgl/local/nvshmem_src_2.10.1-3/build
how to fix it?
my nvidia-smi output is
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:31:00.0 Off | 0 |
| N/A 37C P0 76W / 300W | 8812MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80G... Off | 00000000:4B:00.0 Off | 0 |
| N/A 63C P0 231W / 300W | 45337MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Since you are using IB transport inside your container env, an you share some additional output from the following nvidia-smi & ibdev commands from inside the container ?
nvidia-smi topo -m
ibdev2netdev
ibv_devinfo
PS: If you wish to not use IB transports to communicate b/w 2 GPUs and use P2P transport (NVLink) instead, you can use NVSHMEM_REMOTE_TRANSPORT=None at runtime to bypass initialization of this IB transports.