Nvshmem fails to finalize

I use nvshmem in my program.
but it suddenly fails to finalize.
I use the Attribute-Based Initialization Example in Examples — NVSHMEM 2.10.1 documentation to test it.
It rasie:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

1: received message 0
0: received message 1
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: /dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: aborting due to error in nvshmem_finalize 

aborting due to error in nvshmem_finalize 

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[2802,1],1]
  Exit code:    255
--------------------------------------------------------------------------

if I don not use nvshmem_finalize and nvshmem_free, the error diasppeared.
How to fix it?

Hi,

This was a regression in 2.10.1-3 and is fixed in our repo. Our 2.11 release (TBA) will contain the fix.

Thanks, Seth

For now, the only workaround is the one you mentioned in your comment. Removing nvshmem_finalize will prevent the error from finalizing.
Thanks for bringing this to our attention!

now I am using nvshmem in a docker container.
I install the nvshmem by this env:

export CUDA_HOME=/usr/local/cuda
export GDRCOPY_HOME=/usr/local/gdrcopy
export NVSHMEM_MPI_SUPPORT=1
export MPI_HOME=/usr
//export MPICC=/usr/bin/mpicc
export NVSHMEM_USE_NCCL=1
export NCCL_HOME=/opt/conda/envs/pytorch-ci
export NVSHMEM_IBGDA_SUPPORT=1
export NVSHMEM_BOOTSTRAP_PMI=0

but when i run the official example by

CUDA_VISIBLE_DEVICES=0,1 mpirun --allow-run-as-root -np 2 ./test
it raise:

mpirun --allow-run-as-root -np 2 ./mpi-based-init
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:481: non-zero status: 7 ep_connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1397: non-zero status: 7 cst setup failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:481: non-zero status: 7 ep_connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1397: non-zero status: 7 cst setup failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1430: non-zero status: 7 ep_connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1484: non-zero status: 7 transport create connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1430: non-zero status: 7 ep_connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1484: non-zero status: 7 transport create connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/host/transport/transport.cpp:348: non-zero status: 7 endpoint connection failed 

/dgl/local/nvshmem_src_2.10.1-3/src/host/transport/transport.cpp:348: non-zero status: 7 endpoint connection failed 

/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:852: non-zero status: 7 nvshmem setup connections failed 

/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:nvshmemi_check_state_and_init:933: nvshmem initialization failed, exiting 
/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:852: non-zero status: 7 nvshmem setup connections failed 

/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:nvshmemi_check_state_and_init:933: nvshmem initialization failed, exiting 


/dgl/local/nvshmem_src_2.10.1-3/src/util/cs.cpp:23: non-zero status: 16: No such device, exiting... mutex destroy failed 

/dgl/local/nvshmem_src_2.10.1-3/src/util/cs.cpp:23: non-zero status: 16: No such device, exiting... mutex destroy failed 

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[39538,1],1]
  Exit code:    255
--------------------------------------------------------------------------

I install it in /usr/local/nvshmem.but the error log riased in my build path.
here is my build log:

-- NVSHMEM_PREFIX: /usr/local/nvshmem
-- NVSHMEM_DEVEL: OFF
-- NVSHMEM_DEBUG: OFF
-- NVSHMEM_DEFAULT_PMI2: OFF
-- NVSHMEM_DEFAULT_PMIX: OFF
-- NVSHMEM_DEFAULT_UCX: OFF
-- NVSHMEM_DISABLE_COLL_POLL: ON
-- NVSHMEM_ENABLE_ALL_DEVICE_INLINING: OFF
-- NVSHMEM_ENV_ALL: OFF
-- NVSHMEM_GPU_COLL_USE_LDST: OFF
-- NVSHMEM_IBGDA_SUPPORT: ON
-- NVSHMEM_IBDEVX_SUPPORT: OFF
-- NVSHMEM_IBRC_SUPPORT: ON
-- NVSHMEM_LIBFABRIC_SUPPORT: OFF
-- NVSHMEM_MPI_SUPPORT: ON
-- MPI_HOME: /usr/local/ompi
-- NVSHMEM_NVTX: ON
-- NVSHMEM_PMIX_SUPPORT: OFF
-- NVSHMEM_SHMEM_SUPPORT: OFF
-- NVSHMEM_TEST_STATIC_LIB: OFF
-- NVSHMEM_TIMEOUT_DEVICE_POLLING: OFF
-- NVSHMEM_TRACE: OFF
-- NVSHMEM_UCX_SUPPORT: OFF
-- NVSHMEM_USE_DLMALLOC: OFF
-- NVSHMEM_USE_NCCL: ON
-- NCCL_HOME: /opt/conda/envs/pytorch-ci
-- NVSHMEM_USE_GDRCOPY: ON
-- GDRCOPY_HOME: /usr/local/gdrcopy
-- NVSHMEM_VERBOSE: OFF
-- Setting build type to '' as none was specified.
-- CUDA_HOME: /usr/local/cuda
-- The CUDA compiler identification is NVIDIA 11.8.89
-- The CXX compiler identification is GNU 11.4.0
-- The C compiler identification is GNU 11.4.0
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.8.89") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- CMAKE_CUDA_ARCHITECTURES: 70;80;90
-- Found MPI_C: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so (found version "3.1") 
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Performing Test NVCC_THREADS
-- Performing Test NVCC_THREADS - Success
-- Performing Test HAVE_MLX5DV_UAR_ALLOC_TYPE_NC_DEDICATED
-- Performing Test HAVE_MLX5DV_UAR_ALLOC_TYPE_NC_DEDICATED - Failed
-- Configuring done (7.4s)
-- Generating done (0.2s)
-- Build files have been written to: /dgl/local/nvshmem_src_2.10.1-3/build

how to fix it?
my nvidia-smi output is

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   37C    P0    76W / 300W |   8812MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   63C    P0   231W / 300W |  45337MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Since you are using IB transport inside your container env, an you share some additional output from the following nvidia-smi & ibdev commands from inside the container ?

  • nvidia-smi topo -m
  • ibdev2netdev
  • ibv_devinfo

PS: If you wish to not use IB transports to communicate b/w 2 GPUs and use P2P transport (NVLink) instead, you can use NVSHMEM_REMOTE_TRANSPORT=None at runtime to bypass initialization of this IB transports.