Nvshmem fails to finalize

728882065 · December 19, 2023, 3:44am

I use nvshmem in my program.
but it suddenly fails to finalize.
I use the Attribute-Based Initialization Example in Examples — NVSHMEM 2.10.1 documentation to test it.
It rasie:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

1: received message 0
0: received message 1
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: /dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: aborting due to error in nvshmem_finalize 

aborting due to error in nvshmem_finalize 

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[2802,1],1]
  Exit code:    255
--------------------------------------------------------------------------

if I don not use nvshmem_finalize and nvshmem_free, the error diasppeared.
How to fix it?

sethh · December 19, 2023, 7:25pm

Hi,

This was a regression in 2.10.1-3 and is fixed in our repo. Our 2.11 release (TBA) will contain the fix.

Thanks, Seth

sethh · January 4, 2024, 5:59am

For now, the only workaround is the one you mentioned in your comment. Removing nvshmem_finalize will prevent the error from finalizing.
Thanks for bringing this to our attention!

728882065 · January 9, 2024, 2:07pm

now I am using nvshmem in a docker container.
I install the nvshmem by this env:

export CUDA_HOME=/usr/local/cuda
export GDRCOPY_HOME=/usr/local/gdrcopy
export NVSHMEM_MPI_SUPPORT=1
export MPI_HOME=/usr
//export MPICC=/usr/bin/mpicc
export NVSHMEM_USE_NCCL=1
export NCCL_HOME=/opt/conda/envs/pytorch-ci
export NVSHMEM_IBGDA_SUPPORT=1
export NVSHMEM_BOOTSTRAP_PMI=0

but when i run the official example by

CUDA_VISIBLE_DEVICES=0,1 mpirun --allow-run-as-root -np 2 ./test
it raise:

mpirun --allow-run-as-root -np 2 ./mpi-based-init
/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:481: non-zero status: 7 ep_connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1397: non-zero status: 7 cst setup failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:481: non-zero status: 7 ep_connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1397: non-zero status: 7 cst setup failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1430: non-zero status: 7 ep_connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1484: non-zero status: 7 transport create connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:420: non-zero status: 19 ibv_modify_qp failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1430: non-zero status: 7 ep_connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/modules/transport/ibrc/ibrc.cpp:1484: non-zero status: 7 transport create connect failed 

/dgl/local/nvshmem_src_2.10.1-3/src/host/transport/transport.cpp:348: non-zero status: 7 endpoint connection failed 

/dgl/local/nvshmem_src_2.10.1-3/src/host/transport/transport.cpp:348: non-zero status: 7 endpoint connection failed 

/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:852: non-zero status: 7 nvshmem setup connections failed 

/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:nvshmemi_check_state_and_init:933: nvshmem initialization failed, exiting 
/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:852: non-zero status: 7 nvshmem setup connections failed 

/dgl/local/nvshmem_src_2.10.1-3/src/host/init/init.cu:nvshmemi_check_state_and_init:933: nvshmem initialization failed, exiting 


/dgl/local/nvshmem_src_2.10.1-3/src/util/cs.cpp:23: non-zero status: 16: No such device, exiting... mutex destroy failed 

/dgl/local/nvshmem_src_2.10.1-3/src/util/cs.cpp:23: non-zero status: 16: No such device, exiting... mutex destroy failed 

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[39538,1],1]
  Exit code:    255
--------------------------------------------------------------------------

I install it in /usr/local/nvshmem.but the error log riased in my build path.
here is my build log:

-- NVSHMEM_PREFIX: /usr/local/nvshmem
-- NVSHMEM_DEVEL: OFF
-- NVSHMEM_DEBUG: OFF
-- NVSHMEM_DEFAULT_PMI2: OFF
-- NVSHMEM_DEFAULT_PMIX: OFF
-- NVSHMEM_DEFAULT_UCX: OFF
-- NVSHMEM_DISABLE_COLL_POLL: ON
-- NVSHMEM_ENABLE_ALL_DEVICE_INLINING: OFF
-- NVSHMEM_ENV_ALL: OFF
-- NVSHMEM_GPU_COLL_USE_LDST: OFF
-- NVSHMEM_IBGDA_SUPPORT: ON
-- NVSHMEM_IBDEVX_SUPPORT: OFF
-- NVSHMEM_IBRC_SUPPORT: ON
-- NVSHMEM_LIBFABRIC_SUPPORT: OFF
-- NVSHMEM_MPI_SUPPORT: ON
-- MPI_HOME: /usr/local/ompi
-- NVSHMEM_NVTX: ON
-- NVSHMEM_PMIX_SUPPORT: OFF
-- NVSHMEM_SHMEM_SUPPORT: OFF
-- NVSHMEM_TEST_STATIC_LIB: OFF
-- NVSHMEM_TIMEOUT_DEVICE_POLLING: OFF
-- NVSHMEM_TRACE: OFF
-- NVSHMEM_UCX_SUPPORT: OFF
-- NVSHMEM_USE_DLMALLOC: OFF
-- NVSHMEM_USE_NCCL: ON
-- NCCL_HOME: /opt/conda/envs/pytorch-ci
-- NVSHMEM_USE_GDRCOPY: ON
-- GDRCOPY_HOME: /usr/local/gdrcopy
-- NVSHMEM_VERBOSE: OFF
-- Setting build type to '' as none was specified.
-- CUDA_HOME: /usr/local/cuda
-- The CUDA compiler identification is NVIDIA 11.8.89
-- The CXX compiler identification is GNU 11.4.0
-- The C compiler identification is GNU 11.4.0
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.8.89") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- CMAKE_CUDA_ARCHITECTURES: 70;80;90
-- Found MPI_C: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so (found version "3.1") 
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Performing Test NVCC_THREADS
-- Performing Test NVCC_THREADS - Success
-- Performing Test HAVE_MLX5DV_UAR_ALLOC_TYPE_NC_DEDICATED
-- Performing Test HAVE_MLX5DV_UAR_ALLOC_TYPE_NC_DEDICATED - Failed
-- Configuring done (7.4s)
-- Generating done (0.2s)
-- Build files have been written to: /dgl/local/nvshmem_src_2.10.1-3/build

how to fix it?
my nvidia-smi output is

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   37C    P0    76W / 300W |   8812MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   63C    P0   231W / 300W |  45337MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

arnavg · January 16, 2024, 8:36pm

Since you are using IB transport inside your container env, an you share some additional output from the following nvidia-smi & ibdev commands from inside the container ?

nvidia-smi topo -m
ibdev2netdev
ibv_devinfo

PS: If you wish to not use IB transports to communicate b/w 2 GPUs and use P2P transport (NVLink) instead, you can use NVSHMEM_REMOTE_TRANSPORT=None at runtime to bypass initialization of this IB transports.

Topic		Replies	Views
Raise error when link nvshmem in my application Legacy PGI Compilers cuda , cudnn	13	1137	January 2, 2024
NVSHMEM on multi-node GPUs failed . My gpu is A5000 GPU-Accelerated Libraries nvshmem	5	742	April 1, 2024
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1068	April 24, 2024
NVSHMEM on multi-node GPUs GPU-Accelerated Libraries cuda , nvshmem	8	2301	January 18, 2024
NVSHMEM runtime error GPU-Accelerated Libraries nvshmem	11	1788	August 16, 2022
Nvshmem_runtime_error GPU-Accelerated Libraries nvshmem	3	210	July 7, 2024
Failure in installation of nvshmem GPU-Accelerated Libraries cuda , nvshmem	5	391	March 13, 2024
Broken GPU state query failure in AMD + H100 Confidential Computing	10	919	February 15, 2024
NVSHMEM issues with synchronization GPU-Accelerated Libraries nvshmem	5	707	July 18, 2023
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62051	February 14, 2021

Nvshmem fails to finalize

Related topics