NVSHMEM on multi-node GPUs

I have built nvshmem 2.6.0 with Cuda 11.0 and tried to run some device performance tests. I was able to successfully execute pt-to-pt/shmem_put_bw on 2 GPUs when they are located on the same node. However, the same test fails when I try to run it on 2 GPUs located on two different nodes connected via infiniband, with the following error message:

mype: 0 mype_node: 0 device name: NVIDIA A100-PCIE-40GB bus id: 7
mype: 1 mype_node: 0 device name: NVIDIA A100-PCIE-40GB bus id: 7
src/topo/topo.cpp:68: [GPU 0] Peer GPU 1 is not accessible, exiting …
src/init/init.cu:714: non-zero status: 3 building transport map failed
src/init/init.cu:797: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_state_and_init:44: nvshmem initialization failed, exiting

I also checked with the sample P2P test that GPUs distributed over 2 nodes are supporting P2P memory access:

./simpleP2P
[./simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access…
Peer access from NVIDIA A100-PCIE-40GB (GPU0) → NVIDIA A100-PCIE-40GB (GPU1) : Yes
Peer access from NVIDIA A100-PCIE-40GB (GPU1) → NVIDIA A100-PCIE-40GB (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 13.52GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Disabling peer access…
Shutting down…
Test passed

The topology of the GPU connections is like follows:

        GPU0     GPU1    GPU2   GPU3   mlx5_0  CPU Affinity    NUMA Affinity
 GPU0     X      SYS     SYS    SYS     SYS        0-63            N/A
 GPU1    SYS      X      SYS    SYS     SYS        0-63            N/A
 GPU2    SYS     SYS      X     SYS     SYS        0-63            N/A
 GPU3    SYS     SYS     SYS      X     PHB        0-63            N/A 
 mlx5_0  SYS     SYS     SYS    PHB     X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

The Nvidia-Driver version of the system is 515.06.01. Could you please help me identify the problem why nvshmem fails over multiple nodes?

There’s not enough info here to diagnose the problem.

The P2P test you run is not multi-node aware and makes no statements about multi-node activity. In fact, GPUs on different nodes do not support P2P access as is being demonstrated/tested in that sample code. (This is indeed one of the benefits of nvshmem - to extend peer-like activity across nodes.)

nvshmem in a multi-node use case requires a properly functioning framework using one of the following multi-node transport types:

  • nvshmem launcher based on hydra
  • MPI launch based on MPI
  • Slurm launch based on PMI

You haven’t shown any of the details about how you are launching things. Simply having infiniband connectivity is not enough. You must have one of these multi-node launcher frameworks installed and functioning properly, as well. And your launch syntax must also be correct to run on multiple nodes.

You may wish to review the install guide. If you skip all optional components, you almost certainly will not have a multi-node capable setup.

Thanks for this rapid response! Of course, sorry for omitting those above listed details. I have followed the install instructions and installed Hydra using the provided install_hydra.sh script after building nvshmem. Then I launched the tests with the following syntax

$HYDRA_HOME/bin/nvshmrun -n $SLURM_NTASKS $PERFTEST_INSTALL/device/pt-to-pt/shmem_put_bw

When I run it on a single node, then I get the following output:

| shmem_put_bw | None |
±-----------------------±---------------------+
| size (Bytes) | BW GB/sec |
±-----------------------±---------------------+
1024 | 0.411650 |
±-----------------------±---------------------+
| 2048 | 0.831191 |
±-----------------------±---------------------+
| 4096 | 1.658434 |
±-----------------------±---------------------+
| 8192 | 3.285208 |
±-----------------------±---------------------+
| 16384 | 6.516386 |
±-----------------------±---------------------+
| 32768 | 12.574786 |
±-----------------------±---------------------+
| 65536 | 16.240811 |
±-----------------------±---------------------+
| 131072 | 15.957022 |
±-----------------------±---------------------+
| 262144 | 15.382930 |
±-----------------------±---------------------+
| 524288 | 15.305060 |
±-----------------------±---------------------+
| 1048576 | 21.819743 |
±-----------------------±---------------------+
| 2097152 | 18.278083 |
±-----------------------±---------------------+
| 4194304 | 17.260912 |
±-----------------------±---------------------+
| 8388608 | 17.136655 |
±-----------------------±---------------------+
| 16777216 | 17.042504 |
±-----------------------±---------------------+
| 33554432 | 18.645057 |
±-----------------------±---------------------+

However, when launched across two nodes using Hydra and the same syntax, I receive the error from topo.cpp saying that Peer GPU is not available, as described in the original post. Upon checking the status of optional components, I found that the nv_peer_mem and GDRCopy modules are not available on the nodes. Could this be responsible for multi-node communication failure? If yes, which of the optional modules listed in the install instructions are required for multi-node functionality?

I think the remote transport is not getting initialised properly. Can you show output with NVSHMEM_DEBUG=TRACE and NVSHMEM_DEBUG_SUBSYS=ALL env variables set. Is libibverbs.so in LD_LIBRARY_PATH?

You would need nv_peer_mem. But this error is not because of lack of that…

Thanks again for the fast response! I am also surprised that the test worked on a single GPU node despite the lack of nv_peer_mem module. libibverbs.so is located under /usr/lib64 and has been included in LD_LIBRARY_PATH. When I set the debug environment variables as described above, I still get the same original error output when executed over two nodes:

mype: 1 mype_node: 0 device name: NVIDIA A100-PCIE-40GB bus id: 7
mype: 0 mype_node: 0 device name: NVIDIA A100-PCIE-40GB bus id: 7
src/topo/topo.cpp:68: [GPU 0] Peer GPU 1 is not accessible, exiting …
src/init/init.cu:714: non-zero status: 3 building transport map failed
src/init/init.cu:797: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_state_and_init:44: nvshmem initialization failed, exiting
src/topo/topo.cpp:68: [GPU 1] Peer GPU 0 is not accessible, exiting …
src/init/init.cu:714: non-zero status: 3 building transport map failed
src/init/init.cu:797: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_state_and_init:44: nvshmem initialization failed, exiting

nv_pper_mem is required only when IB is being used. So the error not coming for single node runs is fine.

How are you setting the debug variables? I think they are not getting passed to the program.

I set the debug variables before running the test with Hydra:

NVSHMEM_DEBUG=TRACE
NVSHMEM_DEBUG_SUBSYS=ALL
NVSHMEM_DEBUG_FILE=nvdebug

$HYDRA_HOME/bin/nvshmrun -n $SLURM_NTASKS $PERFTEST_INSTALL/device/pt-to-pt/shmem_put_bw

I also rebuilt nvshmem and Hydra after setting NVSHMEM_DEBUG=TRACE, but still get the same output. At least it seems clear now that the multi-node error is due to lack of nv_peer_mem module.

Hello,

I have try similar things, I compiled the nvshmem perf test ‘alltoall’, and try to run it on two 8-A100 nodes. But still get followings errors:

/data02/home/wenlei.bao/nvshmem/src/modules/transport/ibgda/ibgda.cpp:735: non-zero status: 800 cudaHostRegister failed.

/data02/home/wenlei.bao/nvshmem/src/modules/transport/ibgda/ibgda.cpp:1385: non-zero status: 7 ibgda_nic_mem_gpu_map failed.

/data02/home/wenlei.bao/nvshmem/src/modules/transport/ibgda/ibgda.cpp:2760: non-zero status: 7 ibgda_alloc_and_map_qp_uar failed.

I build by enable IBGDA and below are some envs:

export NVSHMEM_IBGDA_NUM_REQUESTS_IN_BATCH=32
export NVSHMEM_IBGDA_NUM_RC_PER_PE=1
export NVSHMEM_IBGDA_NUM_DCI=1
export NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=0
export NVSHMEM_IBGDA_DCI_MAP_BY=cta
export NVSHMEM_IBGDA_RC_MAP_BY=warp
export NVSHMEM_IB_ENABLE_IBGDA=1
export NVSHMEM_IBGDA_SUPPORT=1
export NVSHMEM_IB_GID_INDEX=3

Any suggestions would be help!
Thanks.

Hi,

My guess is that “PeerMappingOverride” regkey is not set on your system. Can you check with the command below?

$ cat /proc/driver/nvidia/params
...
RegistryDwords: "PeerMappingOverride=1;"
...

Please look at the “(Optional) InfiniBand GPUDirect Async (IBGDA) transport” bullet on NVSHMEM Installation Guide — nvshmem 2.10.1 documentation for more detail.