What is the setting to enable NVSHMEM on multi-node multi-GPU platform with IB

I’m running my NVSHMEM on an HPC cluster managed by SLURM, which has IB for fast cross-node communication. However, my application runs very slowly with my current setting under multi-node cases, way slower than using NCCL.
My current NVSHMEM setting is based on openMPI, UCX

openMPI configuration:
./configure --with-cuda --enable-mpi-cxx --with-slurm --with-ucx=~/ucx-1.17.0  --prefix=~/mylib/openmpi-4.1.1

UCX configuration:
../configure --prefix=~/mylib/ucx-1.17.0 --enable-mt --with-dm --with-cuda=/usr/local/cuda

NVSHMEM build script:

export NVSHMEM_USE_GDRCOPY=0
export NVSHMEM_IBRC_SUPPORT=1
export NVSHMEM_UCX_SUPPORT=1
export UCX_HOME=~/mylib/ucx-1.17.0
export CUDA_HOME=/usr/local/cuda/
export NVSHMEM_MPI_SUPPORT=1
export MPI_HOME=~/mylib/openmpi-4.1.1/
export NVSHMEM_DEFAULT_PMI2=1
export NVSHMEM_PREFIX=~/mylib/nvshmem-3.1.7

The SLURM script I use is:

#!/bin/bash  
#SBATCH -p g078t2  
#SBATCH -N 2  
#SBATCH --ntasks=4  
#SBATCH --ntasks-per-node=2  
#SBATCH --gres=gpu:2  
#SBATCH --time=120:00  
#SBATCH --comment=idmg_bupt  

mpirun -np $SLURM_NTASKS $APP_PATH

Is it because my UCX is configured wrong? It may use TCP for cross-node data exchange in NVSHMEM?

As indicated by the figure below. I have three choices. And, which the fastest way for communication?

s

To clarify, I do have Mellanox OFED and nvidia-peermem enabled.
So, how should I build my openMPI, UCX, and NVSHMEM?
Or, how should I build my openMPI and NVSHMEM without UCX?

You say “slow”, can you share performance numbers (nvshmem perf_test) and how this compares with the inter-node bandwidth of the NIC?

…And, which the fastest way for communication?

Answering the above, libfabric has always reached peak inter-node bandwidth for supercomputers that I work with.

Here is the perftest script written by myself.

#include <iostream>
#include <stdio.h>
#include <mpi.h>
#include <nvshmem.h>
#include <nvshmemx.h>
#include "utils.cuh"

__global__ void warp_copy_kernel(float *dst, float *src, int dim, int p)
{
    int idx = blockIdx.x;
    nvshmemx_float_get_warp((float *)&dst[idx * dim],
                            &src[idx * dim], dim, p);
}

void warp_copy(float *dst, float *src, int nrows, int dim, int p, cudaStream_t stream)
{
    dim3 blockDim(32, 1, 1);
    dim3 gridDim(nrows, 1, 1);
    nvshmemx_barrier_all_on_stream(stream);
    warp_copy_kernel<<<gridDim, blockDim, 0, stream> > >(dst, src, dim, p);
    nvshmemx_barrier_all_on_stream(stream);
}

int main(int argc, char *argv[])
{
    int dim = atoi(argv[1]);
    int rank, nranks;
    cudaStream_t stream;
    nvshmemx_init_attr_t attr;
    MPI_Comm mpi_comm = MPI_COMM_WORLD;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
    attr.mpi_comm = &mpi_comm;
    printf("mpi_info: rank: %d, nranks: %d\n", rank, nranks);
    // Set up NVSHMEM device.
    nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
    int mype_node = nvshmem_my_pe();
    int local_gpu_num = 0;
    cudaGetDeviceCount(&local_gpu_num);
    cudaSetDevice(mype_node % local_gpu_num);
    printf("PE-%d, local_gpu_num: %d, local_gpu_id: %d\n", mype_node, local_gpu_num, mype_node % local_gpu_num);
    cudaStreamCreate(&stream);

    // test bandwidth of using nvshmem for cross-GPU data copy
    float *buffer0 = nullptr, *buffer1 = nullptr;
    int iter = 2000, warmup = 100;
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    int pow = 15;
    for (int i = 5; i < pow; i++)
    {
        int nrows = 2 << i;
        int perf_size = nrows * dim;
        if (mype_node == 0)
            printf("nrows: %d, dim: %d, perf_size: %d\n", nrows, dim, perf_size);
        float ms = 0;
        nvshmemx_barrier_all_on_stream(stream);
        gpuErrchk(cudaStreamSynchronize(stream));
        buffer0 = (float *)nvshmem_malloc(perf_size * sizeof(float));
        buffer1 = (float *)nvshmem_malloc(perf_size * sizeof(float));
        gpuErrchk(cudaMemset(buffer0, 0, perf_size * sizeof(float)));
        gpuErrchk(cudaMemset(buffer1, 0, perf_size * sizeof(float)));
        gpuErrchk(cudaStreamSynchronize(stream));
        // copy data from buffer0 to buffer1
        // copy buffer0(PE0) to buffer1(PE1)
        // copy buffer0(PE1) to buffer1(PE0)
        for (int j = 0; j < warmup; j++)
        {
            nvshmemx_float_get_on_stream(buffer1, buffer0, perf_size, (mype_node + 1) % nranks, stream);
            nvshmemx_barrier_all_on_stream(stream);
        }
        gpuErrchk(cudaStreamSynchronize(stream));
        cudaEventRecord(start, stream);
        for (int j = 0; j < iter; j++)
        {
            nvshmemx_float_get_on_stream(buffer1, buffer0, perf_size, (mype_node + 1) % nranks, stream);
            nvshmemx_barrier_all_on_stream(stream);
        }
        nvshmemx_quiet_on_stream(stream);
        gpuErrchk(cudaStreamSynchronize(stream));
        gpuErrchk(cudaEventRecord(stop, stream));
        gpuErrchk(cudaEventSynchronize(stop));
        gpuErrchk(cudaEventElapsedTime(&ms, start, stop));
        // in KB
        if (mype_node == 0)
        {
            float data_size = 2 * static_cast<float>(perf_size) * sizeof(float) / 1024 / 1024;
            float bandwidth = data_size * iter / ms * 1000;
            printf("nv_host_copy perf_size: %d elems, data_size: %.1f KB, bandwidth: %.3f MB/s\n", perf_size, data_size * 1024, bandwidth);
        }

        // nvshem devive warp copy
        // warmup
        for (int j = 0; j < warmup; j++)
        {
            warp_copy(buffer1, buffer0, nrows, dim, (mype_node + 1) % nranks, stream);
            nvshmemx_barrier_all_on_stream(stream);
        }
        gpuErrchk(cudaStreamSynchronize(stream));
        cudaEventRecord(start, stream);
        for (int j = 0; j < iter; j++)
        {
            warp_copy(buffer1, buffer0, nrows, dim, (mype_node + 1) % nranks, stream);
            nvshmemx_barrier_all_on_stream(stream);
        }
        nvshmemx_quiet_on_stream(stream);
        gpuErrchk(cudaStreamSynchronize(stream));
        gpuErrchk(cudaEventRecord(stop, stream));
        gpuErrchk(cudaEventSynchronize(stop));
        gpuErrchk(cudaEventElapsedTime(&ms, start, stop));
        if (mype_node == 0)
        {
            float data_size = 2 * static_cast<float>(perf_size) * sizeof(float) / 1024 / 1024;
            float bandwidth = data_size * iter / ms * 1000;
            printf("nv_device_copy perf_size: %d elems, data_size: %.1f KB, bandwidth: %.3f MB/s\n", perf_size, data_size * 1024, bandwidth);
        }

        nvshmem_free(buffer0);
        nvshmem_free(buffer1);
    }

    nvshmem_finalize();
    MPI_Finalize();

    if (mype_node == 0)
        printf("=================Done!=================\n");
    return 0;
}

When running this test on two-node with two GPUs, the cpu-side nvshmemx_float_get_on_stream can achieve the optimal bandwidth as expected (16 GB/s), which is the bandwidth of my infiniband. The GPU-side warp_copy can not achieve this expected bandwidth, which is constrained by 64MB/s. While my application is based on GPU-side data copy, it encounters severe performance degradation when running on two nodes with multiple GPUs. To clarify, my application can run very fast on single-node multi-GPU. Is this a normal behavior? or I have incorrectly configured my environment. GPU-side data copy on the cross-node connection (e.g., Infiniband) does have high latency?

To check, here is part of my test result:

nrows: 128, dim: 32, perf_size: 4096
nv_host_copy perf_size: 4096 elems, data_size: 32.0 KB, bandwidth: 622.998 MB/s
nv_device_copy perf_size: 4096 elems, data_size: 32.0 KB, bandwidth: 79.315 MB/s
nrows: 256, dim: 32, perf_size: 8192
nv_host_copy perf_size: 8192 elems, data_size: 64.0 KB, bandwidth: 1209.587 MB/s
nv_device_copy perf_size: 8192 elems, data_size: 64.0 KB, bandwidth: 87.977 MB/s
nrows: 512, dim: 32, perf_size: 16384
nv_host_copy perf_size: 16384 elems, data_size: 128.0 KB, bandwidth: 2047.404 MB/s
nv_device_copy perf_size: 16384 elems, data_size: 128.0 KB, bandwidth: 87.772 MB/s
nrows: 1024, dim: 32, perf_size: 32768
nv_host_copy perf_size: 32768 elems, data_size: 256.0 KB, bandwidth: 3633.369 MB/s
nv_device_copy perf_size: 32768 elems, data_size: 256.0 KB, bandwidth: 68.729 MB/s
nrows: 2048, dim: 32, perf_size: 65536
nv_host_copy perf_size: 65536 elems, data_size: 512.0 KB, bandwidth: 6358.160 MB/s
nv_device_copy perf_size: 65536 elems, data_size: 512.0 KB, bandwidth: 66.602 MB/s
nrows: 4096, dim: 32, perf_size: 131072
nv_host_copy perf_size: 131072 elems, data_size: 1024.0 KB, bandwidth: 8842.070 MB/s
nv_device_copy perf_size: 131072 elems, data_size: 1024.0 KB, bandwidth: 64.663 MB/s
nrows: 8192, dim: 32, perf_size: 262144
nv_host_copy perf_size: 262144 elems, data_size: 2048.0 KB, bandwidth: 12345.572 MB/s
nv_device_copy perf_size: 262144 elems, data_size: 2048.0 KB, bandwidth: 63.868 MB/s
nrows: 16384, dim: 32, perf_size: 524288
nv_host_copy perf_size: 524288 elems, data_size: 4096.0 KB, bandwidth: 14776.895 MB/s
nv_device_copy perf_size: 524288 elems, data_size: 4096.0 KB, bandwidth: 63.069 MB/s
nrows: 32768, dim: 32, perf_size: 1048576
nv_host_copy perf_size: 1048576 elems, data_size: 8192.0 KB, bandwidth: 16232.806 MB/s
nv_device_copy perf_size: 1048576 elems, data_size: 8192.0 KB, bandwidth: 62.359 MB/s

Of course, performance would be bad! Here’s why:

  1. Less important, but your warp_copy includes two more barriers than the *_get_on_stream
  2. For inter-node communication, you need to batch requests to the NIC for better performance. The proxy thread NVSHMEM uses, serially processes all parallel requests you issue from your kernel; in your case, rather than one single request you send nrows requests (one per block) and the proxy thread has to process every single one. Note that NVSHMEM already recommends this batching in their Best Practices guide.

Make these changes and feel free to report the performance results.

What if I use GPUdirect RDMA and InfiniBand GPUDirect Async (IBGDA), as suggested in Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async | NVIDIA Technical Blog, so that the CPU-side proxy is moved to GPU-side. I have tried to configure my environment as suggested as follows:

However, the performance is still very slow, as demonstrated before.

Have you rewritten your kernel to submit only one message request? Do that, benchmark it, and report back.

Yes, as mentioned here IBGDA will help over the proxy-based transport—you have more QPs, yet NVSHMEM still recommends issuing fewer, larger requests for this configuration, see here.

Recall that the GPU binds with at most one NIC, let’s assume your hardware has a 1:1 ratio with no oversubscription. Of course, how the NIC operates is beyond the scope of this reply, but intuitively, even with multiple QPs in IBGDA, they all still issue their requests to one NIC, who I presume has limited capacity to process the thousands of requests that your kernel issues.

In other words, the transport is not what needs to be changed here, it is the data transfer logic of your kernel.

Thanks a lot for your kind reply!
As suggested, I modified my kernel:

void warp_copy(float *dst, float *src, int nrows, int dim, int p, cudaStream_t stream)
{
    dim3 blockDim(32, 1, 1);
    dim3 gridDim(nrows, 1, 1);
    // nvshmemx_barrier_all_on_stream(stream);
    // warp_copy_kernel<<<gridDim, blockDim, 0, stream>>>(dst, src, dim, p);
    warp_copy_kernel<<<1, blockDim, 0, stream>>>(dst, src, dim * nrows, p);
    // nvshmemx_barrier_all_on_stream(stream);
}

The bandwidth did go up:

nv_host_copy perf_size: 2048 elems, data_size: 16.0 KB, bandwidth: 206.335 MB/s
nv_device_copy perf_size: 2048 elems, data_size: 16.0 KB, bandwidth: 292.895 MB/s
nrows: 128, dim: 32, perf_size: 4096
nv_host_copy perf_size: 4096 elems, data_size: 32.0 KB, bandwidth: 501.225 MB/s
nv_device_copy perf_size: 4096 elems, data_size: 32.0 KB, bandwidth: 576.903 MB/s
nrows: 256, dim: 32, perf_size: 8192
nv_host_copy perf_size: 8192 elems, data_size: 64.0 KB, bandwidth: 919.156 MB/s
nv_device_copy perf_size: 8192 elems, data_size: 64.0 KB, bandwidth: 1078.684 MB/s
nrows: 512, dim: 32, perf_size: 16384
nv_host_copy perf_size: 16384 elems, data_size: 128.0 KB, bandwidth: 1745.919 MB/s
nv_device_copy perf_size: 16384 elems, data_size: 128.0 KB, bandwidth: 2095.916 MB/s
nrows: 1024, dim: 32, perf_size: 32768
nv_host_copy perf_size: 32768 elems, data_size: 256.0 KB, bandwidth: 3147.968 MB/s
nv_device_copy perf_size: 32768 elems, data_size: 256.0 KB, bandwidth: 3946.313 MB/s
nrows: 2048, dim: 32, perf_size: 65536
nv_host_copy perf_size: 65536 elems, data_size: 512.0 KB, bandwidth: 5290.957 MB/s
nv_device_copy perf_size: 65536 elems, data_size: 512.0 KB, bandwidth: 6207.373 MB/s
nrows: 4096, dim: 32, perf_size: 131072
nv_host_copy perf_size: 131072 elems, data_size: 1024.0 KB, bandwidth: 8357.475 MB/s
nv_device_copy perf_size: 131072 elems, data_size: 1024.0 KB, bandwidth: 8583.883 MB/s
nrows: 8192, dim: 32, perf_size: 262144
nv_host_copy perf_size: 262144 elems, data_size: 2048.0 KB, bandwidth: 10971.473 MB/s
nv_device_copy perf_size: 262144 elems, data_size: 2048.0 KB, bandwidth: 11906.105 MB/s
nrows: 16384, dim: 32, perf_size: 524288
nv_host_copy perf_size: 524288 elems, data_size: 4096.0 KB, bandwidth: 13540.871 MB/s
nv_device_copy perf_size: 524288 elems, data_size: 4096.0 KB, bandwidth: 13888.890 MB/s
nrows: 32768, dim: 32, perf_size: 1048576
nv_host_copy perf_size: 1048576 elems, data_size: 8192.0 KB, bandwidth: 14949.735 MB/s
nv_device_copy perf_size: 1048576 elems, data_size: 8192.0 KB, bandwidth: 15157.218 MB/s

However, my application involves random remote memory access to remote PEs. It is challenging to batch multiple random inter-node requests within a GPU kernel. For this reason, I first benchmarked nv_device_copy, with each warp requesting dim remote elements. Even with IBGDA, fine-grained inter-node data transfer (dim elements per request) remains highly inefficient, significantly below the NIC bandwidth? So, IBGDA cannot automatically batch certain data transfer requests to enhance overall efficiency?

I once observed very fast fine-grained inter-node data transfer over IBGDA, but the log file was accidentally deleted. Unfortunately, this high-performance result has not been reproducible ever since. This makes me question whether it was due to an incorrect observation, errors in my kernel code, or improper hardware/software environment configuration.

I want to confirm whether fine-grained inter-node data transfer is inherently very slow, regardless of the transport method used (IBGDA or IBRC). If this is indeed the case, I can focus on optimizing my transfer logic.

In principle, yes, such fine-grained, non-uniform data transfer patterns will yield poor performance regardless of the transport.

You need to really think about the logic of your code and maybe consider storing data locally to avoid communicating over the network.

Also, maybe consider leveraging NVLink instead? The bandwidth will still be poor, but much better than that of Infiniband.

How would you propose this to be done? If the data accesses are not amenable to coalescing then there’s no other automatic optimization that can be done.

I have not confirmed from the implementation yet, but it would be not be surprising that work queue items (transfer requests) are treated separately when processing the queue, thus there may not even be coalescing across queue items even if those transfers are indeed amenable to coalescing. All the more reason why batching to fewer requests is the best practice.