Hi, we are running nvshmem 3.0.6 with libfabric EFA on AWS EC2 p4d instances (2 instances of 8*A100). We are trying to make the GPU0s on the two nodes to run our project (which works with same setup on single node GPU0 and GPU1).
While the setup passes nvshmem examples like put-on-block
and dev-guide-ring-mpi
(GPU0s on each node), it runs into runtime error in our project.
There isn’t much in the log, but with a
python3: /home/ubuntu/packages/nvshmem_src_3.0.6-4/src/include/internal/host/nvshmem_internal.h:142: void nvshmemi_get_local_mem_handle(nvshmem_mem_handle_t**, size_t*, void*, int): Assertion `*handle != NULL' failed.
We are wondering what this error usually suggests.
Flags we used to run our script (exact same flags to run the simple nvshmem examples):
mpirun -hostfile hostfile.txt -x NVSHMEMTEST_USE_MPI_LAUNCHER=1 -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -x FI_EFA_USE_DEVICE_RDMA=1 -x NVSHMEM_LIBFABRIC_PROVIDER=efa -x NVSHMEM_REMOTE_TRANSPORT=libfabric -x NVSHMEM_DISABLE_CUDA_VMM=1 -x NVSHMEM_DEBUG=INFO -x PYTHONPATH=$PYTHONPATH /home/ubuntu/miniconda3/bin/python3 test_script.py
Thank you!
Full output with NVSHMEM_DEBUG=INFO here, with IP-A and B signifying the two nodes:
ip-172-31-31-108:3935432:3935432 [0] NVSHMEM INFO PE distribution has been identified as NVSHMEMI_PE_DIST_ROUNDROBIN
NVSHMEM configuration:
CUDA API 12010
CUDA Runtime 12010
CUDA Driver 12020
Build Timestamp Jul 16 2024 17:12:43
Build Variables
NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF
NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF NVSHMEM_DISABLE_COLL_POLL=ON
NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_GPU_COLL_USE_LDST=OFF
NVSHMEM_IBGDA_SUPPORT=OFF NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF
NVSHMEM_IBDEVX_SUPPORT=OFF NVSHMEM_IBRC_SUPPORT=ON NVSHMEM_LIBFABRIC_SUPPORT=ON
NVSHMEM_MPI_SUPPORT=ON NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF
NVSHMEM_SHMEM_SUPPORT=OFF NVSHMEM_TEST_STATIC_LIB=OFF
NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF NVSHMEM_UCX_SUPPORT=OFF
NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=ON NVSHMEM_USE_GDRCOPY=ON
NVSHMEM_VERBOSE=OFF CUDA_HOME=/usr/local/cuda GDRCOPY_HOME=/usr/local/gdrdrv
LIBFABRIC_HOME=/opt/amazon/efa MPI_HOME=/opt/amazon/openmpi
NCCL_HOME=/opt/nccl/build NVSHMEM_PREFIX=/home/ubuntu/nvshmem306 PMIX_HOME=/usr
SHMEM_HOME=/opt/amazon/openmpi UCX_HOME=/usr/local/ucx
ip-A:3935432:3935432 [0] NVSHMEM INFO PE 0 (process) affinity to 6 CPUs:
0 1 2 48 49 50
ip-A:3935432:3935432 [0] NVSHMEM INFO cudaDriverVersion 12020
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
ip-A:3935432:3935432 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
ip-A:3935432:3935432 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x5675547b7660
ip-B:659861:659861 [0] NVSHMEM INFO PE distribution has been identified as NVSHMEMI_PE_DIST_ROUNDROBIN
ip-B:659861:659861 [0] NVSHMEM INFO PE 1 (process) affinity to 6 CPUs:
0 1 2 48 49 50
ip-B:659861:659861 [0] NVSHMEM INFO cudaDriverVersion 12020
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
ip-B:659861:659861 [0] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
ip-B:659861:659861 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x61bae2073d50
ip-A:3935432:3935432 [0] NVSHMEM INFO [0] Created stream 0x567555f63e00 for device 0
ip-A:3935432:3935432 [0] NVSHMEM INFO NVML library found. libnvidia-ml.so.1
ip-B:659861:659861 [0] NVSHMEM INFO [1] Created stream 0x61bae38ed520 for device 0
ip-B:659861:659861 [0] NVSHMEM INFO NVML library found. libnvidia-ml.so.1
ip-A:3935432:3935432 [0] NVSHMEM INFO team psync mem req 499712 bytes, team mem total req 127926952 bytes, max teams 32
ip-B:659861:659861 [0] NVSHMEM INFO team psync mem req 499712 bytes, team mem total req 127926952 bytes, max teams 32
ip-B:659861:659861 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric
/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/libfabric/libfabric.cpp 1549 Interface did not return NIC structure to fi_getinfo. Skipping.
ip-172-31-31-108:3935432:3935432 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric
/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/libfabric/libfabric.cpp 1549 Interface did not return NIC structure to fi_getinfo. Skipping.
/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/libfabric/libfabric.cpp 1163 Selected provider efa, fabric efa, nic rdmap16s27, hmem yes
/home/ubuntu/packages/nvshmem_src_3.0.6-4/src/modules/transport/libfabric/libfabric.cpp 1163 Selected provider efa, fabric efa, nic rdmap16s27, hmem yes
ip-A:3935432:3935432 [0] NVSHMEM INFO [0] status 0 cudaErrorInvalidValue 1 cudaErrorInvalidSymbol 13 cudaErrorInvalidMemcpyDirection 21 cudaErrorNoKernelImageForDevice 209
ip-B:659861:659861 [0] NVSHMEM INFO [1] status 0 cudaErrorInvalidValue 1 cudaErrorInvalidSymbol 13 cudaErrorInvalidMemcpyDirection 21 cudaErrorNoKernelImageForDevice 209
ip-A:3935432:3935432 [0] NVSHMEM INFO P2P list: 0
ip-B:659861:659861 [0] NVSHMEM INFO P2P list: 1
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=0, stride=1, size=1
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEMX_TEAM_NODE: start=0, stride=1, size=1
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=1, stride=1, size=1
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEMX_TEAM_SAME_MYPE_NODE: start=0, stride=1, size=2
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEMX_TEAM_NODE: start=1, stride=1, size=1
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEMX_TEAM_SAME_MYPE_NODE: start=0, stride=1, size=2
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEMI_TEAM_SAME_GPU: start=0, stride=1, size=1
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEMI_TEAM_SAME_GPU: start=1, stride=1, size=1
ip-A:3935432:3935432 [0] NVSHMEM INFO NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=2
ip-B:659861:659861 [0] NVSHMEM INFO NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=2
python3: /home/ubuntu/packages/nvshmem_src_3.0.6-4/src/include/internal/host/nvshmem_internal.h:142: void nvshmemi_get_local_mem_handle(nvshmem_mem_handle_t**, size_t*, void*, int): Assertion `*handle != NULL' failed.
[ip-A:3935432] *** Process received signal ***
[ip-A:3935432] Signal: Aborted (6)
[ip-A:3935432] Signal code: (-6)
[ip-A:3935432] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x74634a042520]
[ip-A:3935432] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x74634a0969fc]
[ip-A:3935432] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x74634a042476]
[ip-A:3935432] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x74634a0287f3]
[ip-A:3935432] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x74634a02871b]
[ip-A:3935432] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x74634a039e96]
[ip-A:3935432] [ 6] /home/ubuntu/nvshmem306/lib/libnvshmem_host.so.3(+0x100c2c)[0x7462d5500c2c]
[ip-A:3935432] [ 7] /home/ubuntu/nvshmem306/lib/libnvshmem_host.so.3(+0x100d70)[0x7462d5500d70]
[ip-A:3935432] [ 8] /home/ubuntu/nvshmem306/lib/libnvshmem_host.so.3(+0x1001da)[0x7462d55001da]
[ip-A:3935432] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x74634a094ac3]
[ip-A:3935432] [10] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x74634a126850]
[ip-A:3935432] *** End of error message ***
Edit: more information; node A performs 3 NVSHMEMI_OP_PUT successfully (note the remote pointer and local pointer are the same). However, upon NVSHMEMI_OP_GET, it fails (the local and remote pointers printed are different)
Logs added in nvshmem source code to demonstrate this.
ip-A progress_channels NVSHMEMI_OP_PUT
ip-A process_channel_dma, rptr: 0x100310d8000, lptr: 0x100310d8000
ip-A nvshmemi_process_multisend_rma
ip-A transport_idx: 1, mype: 0
ip-A progress_channels NVSHMEMI_OP_PUT
ip-A process_channel_dma, rptr: 0x10031c48000, lptr: 0x10031c48000
ip-A nvshmemi_process_multisend_rma
ip-A transport_idx: 1, mype: 0
ip-A progress_channels NVSHMEMI_OP_PUT
ip-A process_channel_dma, rptr: 0x100323e8000, lptr: 0x100323e8000
ip-A nvshmemi_process_multisend_rma
ip-A transport_idx: 1, mype: 0
ip-A progress_channels NVSHMEMI_OP_GET
ip-A process_channel_dma, rptr: 0x10039801200, lptr: 0x7cfab5400c00
ip-A nvshmemi_process_multisend_rma
ip-A transport_idx: 1, mype: 0
python3: /home/ubuntu/packages/nvshmem_src_3.0.6-4/src/include/internal/host/nvshmem_internal.h:142: void nvshmemi_get_local_mem_handle(nvshmem_mem_handle_t**, size_t*, void*, int): Assertion `*handle != NULL' failed.
The remote pointer is an address we used nvshmem_malloc()
to allocate, and the local pointer is an address we used cudaMalloc()
to allocate. The error occurred when ip-A tries to call nvshmemx_getmem_nbi_block()
.