3.0.6 libfabric EFA nvshmemi_get_local_mem_handle(): Assertion `*handle != NULL' failed

Hi, we are running nvshmem 3.0.6 with libfabric EFA on AWS EC2 p4d instances (2 instances of 8*A100). We are trying to make the GPU0s on the two nodes to run our project (which works with same setup on single node GPU0 and GPU1).

While the setup passes nvshmem examples like put-on-block and dev-guide-ring-mpi (GPU0s on each node), it runs into runtime error in our project.

There isn’t much in the log, but with a

python3: /home/ubuntu/packages/nvshmem_src_3.0.6-4/src/include/internal/host/nvshmem_internal.h:142: void nvshmemi_get_local_mem_handle(nvshmem_mem_handle_t**, size_t*, void*, int): Assertion `*handle != NULL' failed.

We are wondering what this error usually suggests.

Flags we used to run our script (exact same flags to run the simple nvshmem examples):

mpirun -hostfile hostfile.txt -x NVSHMEMTEST_USE_MPI_LAUNCHER=1 -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -x FI_EFA_USE_DEVICE_RDMA=1 -x NVSHMEM_LIBFABRIC_PROVIDER=efa -x NVSHMEM_REMOTE_TRANSPORT=libfabric  -x NVSHMEM_DISABLE_CUDA_VMM=1 -x NVSHMEM_DEBUG=INFO -x PYTHONPATH=$PYTHONPATH  /home/ubuntu/miniconda3/bin/python3 test_script.py

Thank you!

Full output with NVSHMEM_DEBUG=INFO here, with IP-A and B signifying the two nodes:

ip-172-31-31-108:3935432:3935432 [0] NVSHMEM INFO PE distribution has been identified as NVSHMEMI_PE_DIST_ROUNDROBIN
NVSHMEM configuration:
  CUDA API                     12010
  CUDA Runtime                 12010
  CUDA Driver                  12020
  Build Timestamp              Jul 16 2024 17:12:43
  Build Variables             
        NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF
        NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF NVSHMEM_DISABLE_COLL_POLL=ON
        NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_GPU_COLL_USE_LDST=OFF
        NVSHMEM_IBGDA_SUPPORT=OFF NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF
        NVSHMEM_IBDEVX_SUPPORT=OFF NVSHMEM_IBRC_SUPPORT=ON NVSHMEM_LIBFABRIC_SUPPORT=ON
        NVSHMEM_MPI_SUPPORT=ON NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF
        NVSHMEM_SHMEM_SUPPORT=OFF NVSHMEM_TEST_STATIC_LIB=OFF
        NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF NVSHMEM_UCX_SUPPORT=OFF
        NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=ON NVSHMEM_USE_GDRCOPY=ON
        NVSHMEM_VERBOSE=OFF CUDA_HOME=/usr/local/cuda GDRCOPY_HOME=/usr/local/gdrdrv
        LIBFABRIC_HOME=/opt/amazon/efa MPI_HOME=/opt/amazon/openmpi
        NCCL_HOME=/opt/nccl/build NVSHMEM_PREFIX=/home/ubuntu/nvshmem306 PMIX_HOME=/usr
        SHMEM_HOME=/opt/amazon/openmpi UCX_HOME=/usr/local/ucx

ip-A:3935432:3935432 [0] NVSHMEM INFO PE 0 (process) affinity to 6 CPUs:
     0  1  2 48 49 50 
ip-A:3935432:3935432 [0] NVSHMEM INFO cudaDriverVersion 12020
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
ip-A:3935432:3935432 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
ip-A:3935432:3935432 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x5675547b7660
ip-B:659861:659861 [0] NVSHMEM INFO PE distribution has been identified as NVSHMEMI_PE_DIST_ROUNDROBIN
ip-B:659861:659861 [0] NVSHMEM INFO PE 1 (process) affinity to 6 CPUs:
     0  1  2 48 49 50 
ip-B:659861:659861 [0] NVSHMEM INFO cudaDriverVersion 12020
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
ip-B:659861:659861 [0] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
ip-B:659861:659861 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x61bae2073d50
ip-A:3935432:3935432 [0] NVSHMEM INFO [0] Created stream 0x567555f63e00 for device 0
ip-A:3935432:3935432 [0] NVSHMEM INFO NVML library found. libnvidia-ml.so.1
ip-B:659861:659861 [0] NVSHMEM INFO [1] Created stream 0x61bae38ed520 for device 0
ip-B:659861:659861 [0] NVSHMEM INFO NVML library found. libnvidia-ml.so.1
ip-A:3935432:3935432 [0] NVSHMEM INFO team psync mem req 499712 bytes, team mem total req 127926952 bytes, max teams 32

ip-B:659861:659861 [0] NVSHMEM INFO team psync mem req 499712 bytes, team mem total req 127926952 bytes, max teams 32

ip-B:659861:659861 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/libfabric/libfabric.cpp 1549 Interface did not return NIC structure to fi_getinfo. Skipping.

ip-172-31-31-108:3935432:3935432 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/libfabric/libfabric.cpp 1549 Interface did not return NIC structure to fi_getinfo. Skipping.

/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/libfabric/libfabric.cpp 1163 Selected provider efa, fabric efa, nic rdmap16s27, hmem yes
/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/libfabric/libfabric.cpp 1163 Selected provider efa, fabric efa, nic rdmap16s27, hmem yes
ip-A:3935432:3935432 [0] NVSHMEM INFO [0] status 0 cudaErrorInvalidValue 1 cudaErrorInvalidSymbol 13 cudaErrorInvalidMemcpyDirection 21 cudaErrorNoKernelImageForDevice 209
ip-B:659861:659861 [0] NVSHMEM INFO [1] status 0 cudaErrorInvalidValue 1 cudaErrorInvalidSymbol 13 cudaErrorInvalidMemcpyDirection 21 cudaErrorNoKernelImageForDevice 209
ip-A:3935432:3935432 [0] NVSHMEM INFO P2P list: 0 
ip-B:659861:659861 [0] NVSHMEM INFO P2P list: 1 
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=0, stride=1, size=1
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEMX_TEAM_NODE: start=0, stride=1, size=1
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=1, stride=1, size=1
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEMX_TEAM_SAME_MYPE_NODE: start=0, stride=1, size=2
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEMX_TEAM_NODE: start=1, stride=1, size=1
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEMX_TEAM_SAME_MYPE_NODE: start=0, stride=1, size=2
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEMI_TEAM_SAME_GPU: start=0, stride=1, size=1
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEMI_TEAM_SAME_GPU: start=1, stride=1, size=1
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=2
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=2
python3: /home/ubuntu/packages/nvshmem_src_3.0.6-4/src/include/internal/host/nvshmem_internal.h:142: void nvshmemi_get_local_mem_handle(nvshmem_mem_handle_t**, size_t*, void*, int): Assertion `*handle != NULL' failed.
[ip-A:3935432] *** Process received signal ***
[ip-A:3935432] Signal: Aborted (6)
[ip-A:3935432] Signal code:  (-6)
[ip-A:3935432] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x74634a042520]
[ip-A:3935432] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x74634a0969fc]
[ip-A:3935432] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x74634a042476]
[ip-A:3935432] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x74634a0287f3]
[ip-A:3935432] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x74634a02871b]
[ip-A:3935432] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x74634a039e96]
[ip-A:3935432] [ 6] /home/ubuntu/nvshmem306/lib/libnvshmem_host.so.3(+0x100c2c)[0x7462d5500c2c]
[ip-A:3935432] [ 7] /home/ubuntu/nvshmem306/lib/libnvshmem_host.so.3(+0x100d70)[0x7462d5500d70]
[ip-A:3935432] [ 8] /home/ubuntu/nvshmem306/lib/libnvshmem_host.so.3(+0x1001da)[0x7462d55001da]
[ip-A:3935432] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x74634a094ac3]
[ip-A:3935432] [10] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x74634a126850]
[ip-A:3935432] *** End of error message ***

Edit: more information; node A performs 3 NVSHMEMI_OP_PUT successfully (note the remote pointer and local pointer are the same). However, upon NVSHMEMI_OP_GET, it fails (the local and remote pointers printed are different)

Logs added in nvshmem source code to demonstrate this.

ip-A progress_channels NVSHMEMI_OP_PUT
ip-A process_channel_dma, rptr: 0x100310d8000, lptr: 0x100310d8000
ip-A nvshmemi_process_multisend_rma
ip-A transport_idx: 1, mype: 0
ip-A progress_channels NVSHMEMI_OP_PUT
ip-A process_channel_dma, rptr: 0x10031c48000, lptr: 0x10031c48000
ip-A nvshmemi_process_multisend_rma
ip-A transport_idx: 1, mype: 0
ip-A progress_channels NVSHMEMI_OP_PUT
ip-A process_channel_dma, rptr: 0x100323e8000, lptr: 0x100323e8000
ip-A nvshmemi_process_multisend_rma
ip-A transport_idx: 1, mype: 0

ip-A progress_channels NVSHMEMI_OP_GET
ip-A process_channel_dma, rptr: 0x10039801200, lptr: 0x7cfab5400c00
ip-A nvshmemi_process_multisend_rma
ip-A transport_idx: 1, mype: 0
python3: /home/ubuntu/packages/nvshmem_src_3.0.6-4/src/include/internal/host/nvshmem_internal.h:142: void nvshmemi_get_local_mem_handle(nvshmem_mem_handle_t**, size_t*, void*, int): Assertion `*handle != NULL' failed.

The remote pointer is an address we used nvshmem_malloc() to allocate, and the local pointer is an address we used cudaMalloc()to allocate. The error occurred when ip-A tries to call nvshmemx_getmem_nbi_block().

The issue seems to be resolved. We did not use nvshmemx_buffer_register() for the local operands that are cudaMalloc()'d. We didn’t see the following part of the FAQ

Note that NVSHMEM requires the local and remote pointer to both be symmetric for communication with a remote peer connected by InfiniBand. If the remote peer is P2P accessible (PCI-E or NVLink), the local pointer can be obtained using cudaMalloc and is not required to be from the symmetric heap.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.