cudaMemcpyPeerAsync behavior for different hardware

I am experiencing a weird behavior with cudaMemcpyPeerAsync depending on the hardware that I am using (exact same code, hardware of the same major generation, running the same executable obtained for architecture sm_80). The first one is a DGXA100 system, thus with NVLink. The second is a node with 8 RTX A5000 GPUs connected to an Intel CPU, without NVLink. The behavior is as expected on the former, while the latter fails to produce the correct result.

The code performs an equivalent to a MPI Alltoall accross all devices, where each device sends (and receives) a buffer of a fixed size to (from) all devices. I first check and enable peer-to-peer communication across devices using cudaDeviceCanAccessPeer and cudaDeviceEnablePeerAccess. Then the communication looks as follows (minimal reproducible example is provided below):

for (int srcDevice=0; srcDevice<num_devices; ++srcDevice) {
    cudaSetDevice(srcDevice);
    for (int dstDevice=0; dstDevice<num_devices; ++dstDevice) {
        if (dstDevice==srcDevice) {
            cudaMemcpyAsync(
                &d_b[dstDevice][n*srcDevice],
                &d_a[srcDevice][n*dstDevice],
                n*sizeof(int),
                cudaMemcpyDeviceToDevice,
                streams[srcDevice]);
        } else {
            cudaMemcpyPeerAsync(
                &d_b[dstDevice][n*srcDevice],
                dstDevice,
                &d_a[srcDevice][n*dstDevice],
                srcDevice,
                n*sizeof(int),
                streams[srcDevice]);
        }
    }
}

for (int dev_id=0; dev_id<num_devices; ++dev_id) {
    cudaStreamSynchronize(streams[dev_id]);
}

Note that I implicitly assume that cudaStreamSynchronize(streams[dev_id]) will return once all destination devices have entirely received their buffers coming from device dev_id which made use of stream streams[dev_id], not once the source device dev_id has entirely sent its buffers (I don’t know the details on how that works, since I am chosing the stream on the source device, not the one on the destination device).

On the DGXA100 system, everything works perfectly. On the A5000 node, the code runs without error but the results are wrong. In fact, I observe that the peer-to-peer communication don’t work (the buffers are not peer-to-peer communicated). Moreover, the profile timeline looks strange.

DGXA100 profiler timeline (2 GPUs):

DGXA100 profiler timeline (3 GPUs):

A5000 profiler timeline (2 GPUs):

A5000 profiler timeline (3 GPUs):

Question

Is there anything special I should care about on the A5000 node, making it different from the DGXA100 system, which could explain why the code works on the DGXA100 but fails to produce the correct answer on the A5000 node ?

Sub-question

Why is there around 50 ms of “nothing” on the timeline of the A5000 system, between the peer-to-peer communication and the on-device communication (big blank area(s) on the figures above) ? From the CUDA API line of the profiler, it looks like cudaStreamSynchronize is waiting for work to be finished, but this work (here, it can only be memcopies) is not captured by the profiler.

What I tried

I tried putting several cudaDeviceSynchronize() all around communication steps. I also tried halting the host right after invoking a memcopy using std::this_thread::sleep_for(std::chrono::milliseconds(5000)); to give time to CUDA APIs to terminate (both on the send and receive side). No effect.

topology

Here is the output of nvidia-smi topo -m on the A5000 node (here on an allocation with 2 out of the 8 GPUs):

	   GPU0	GPU1 NIC0 NIC1	CPU Affinity	NUMA Affinity
GPU0	 X 	PXB	 SYS  SYS	0-1	0-1
GPU1	PXB	 X 	 SYS  SYS	0-1	0-1
NIC0	SYS	SYS	 X 	  PIX		
NIC1	SYS	SYS	 PIX  X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

Full reproducible example:

/*
Compilation: nvcc -arch=sm_80 main.cu -o main.x
(nvcc version 12.1.1)
*/
#include <cuda_runtime.h>
#include <iostream>
#include <stdio.h>

inline void gpuCheck(cudaError_t code, const char *file, const int line, const bool abort=true) {
    if (code != cudaSuccess) 
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) {
            fprintf(stderr, "aborting\n");
            std::abort();
        }
    }
}
#define CUDA_CHECK_ERROR(code) { gpuCheck((code), __FILE__, __LINE__); }

inline void gpuCheckLast(const char *file, const int line) {
    gpuCheck(cudaPeekAtLastError(), file, line);
    gpuCheck(cudaDeviceSynchronize(), file, line);
}
#define CUDA_CHECK_KERNEL_ERROR { gpuCheckLast(__FILE__, __LINE__); }

__host__
void checkAndEnableP2P(int num_devices) {
    for (int i=0; i<num_devices; i++) {
        CUDA_CHECK_ERROR( cudaSetDevice(i) );
        for (int j=0; j<num_devices; j++) {
            int canAccessPeer;
            if (i!=j) {
                CUDA_CHECK_ERROR( cudaDeviceCanAccessPeer(&canAccessPeer, i, j) );
                if (canAccessPeer==0) {
                    std::cerr << "Problem: device " << i << " cannot access peer device " << j << std::endl;
                    std::abort();
                }
                std::cerr << "Enabling access to peer device " << j << " from device " << i << std::endl;
                CUDA_CHECK_ERROR( cudaDeviceEnablePeerAccess(j, 0) );
            }
        }
    }
}

__global__ 
void kernelInit(int *d_v, int const size, int const val) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<size) {
        d_v[i] = val;
    }
}

__global__ 
void kernelA(int *d_a, int const size, int const num_devices, int const t) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<size) {
        d_a[i] *= (t + num_devices*i/size);
    }
}

__global__
void kernelB(int *d_b, int const size, int const u) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<size) {
        d_b[i] += u;
    }
}

__host__
void get_exact_values(int *checkVals, int const num_devices) {
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        int base = 100;
        for (int did=0; did<dev_id; ++did) {base*=10;}
        for (int j=0; j<num_devices; ++j) {
            checkVals[dev_id*num_devices + j] = base + (j+1)*(2*(j+1)+dev_id);
        }
    }
}

__global__
void kernelCheckValue(int const *d_b, int const size, int const checkVal, int *d_cW) {
    // count the number of values in d_b which differ from checkVal. d_cW is the output count
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<size) {
        if (d_b[i]!=checkVal) {
            int old = atomicAdd(&d_cW[0], 1);
        }
    }
}

__global__
void printArray(int *d_v, int const size) {
    for (int i=0; i<size; ++i) {
        printf("array[%d] = %d\n", i, d_v[i]);
    }
}

int main() {
    int num_devices;
    CUDA_CHECK_ERROR( cudaGetDeviceCount(&num_devices) );
    checkAndEnableP2P(num_devices);

    const int n = 1<<26;
    const int N = num_devices * n;
    std::cout << "num_devices = " << num_devices << std::endl;
    std::cout << "n = " << n << std::endl;
    std::cout << "N = " << N << std::endl;
    const int blockSize = 256;
    const int numBlocks = (N + blockSize - 1)/blockSize;

    cudaStream_t streams[num_devices];
    int *d_a[num_devices];
    int *d_b[num_devices];
    int *d_countWrong[num_devices];
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        CUDA_CHECK_ERROR( cudaSetDevice(dev_id) );
        CUDA_CHECK_ERROR( cudaStreamCreateWithFlags(&streams[dev_id], cudaStreamDefault) );
        CUDA_CHECK_ERROR( cudaMalloc((void**)&d_a[dev_id], N*sizeof(int)) );
        CUDA_CHECK_ERROR( cudaMalloc((void**)&d_b[dev_id], N*sizeof(int)) );
        CUDA_CHECK_ERROR( cudaMalloc((void**)&d_countWrong[dev_id], num_devices*sizeof(int)) );
    }

    // buffers initialization
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        CUDA_CHECK_ERROR( cudaSetDevice(dev_id) );
        kernelInit<<<numBlocks, blockSize, 0, streams[dev_id]>>>(d_a[dev_id], N, dev_id+1);
    }

    // let's synchronize to start kernelA "at the same time" on all devices
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        CUDA_CHECK_ERROR( cudaStreamSynchronize(streams[dev_id]) );
    }

    // kernelA
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        CUDA_CHECK_ERROR( cudaSetDevice(dev_id) );
        kernelA<<<numBlocks, blockSize, 0, streams[dev_id]>>>(d_a[dev_id], N, num_devices, 2*(dev_id+1));
    }

    // Communication Device to Device
    for (int srcDevice=0; srcDevice<num_devices; ++srcDevice) {
        CUDA_CHECK_ERROR( cudaSetDevice(srcDevice) );
        for (int dstDevice=0; dstDevice<num_devices; ++dstDevice) {
            if (dstDevice==srcDevice) {
                CUDA_CHECK_ERROR( cudaMemcpyAsync(
                    &d_b[dstDevice][n*srcDevice],
                    &d_a[srcDevice][n*dstDevice],
                    n*sizeof(int),
                    cudaMemcpyDeviceToDevice,
                    streams[srcDevice]) );
            } else {
                CUDA_CHECK_ERROR( cudaMemcpyPeerAsync(
                    &d_b[dstDevice][n*srcDevice],
                    dstDevice,
                    &d_a[srcDevice][n*dstDevice],
                    srcDevice,
                    n*sizeof(int),
                    streams[srcDevice]) );
            }
        }
    }

    // make sure communication is over on all streams
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        CUDA_CHECK_ERROR( cudaStreamSynchronize(streams[dev_id]) );
    }

    // kernelB
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        CUDA_CHECK_ERROR( cudaSetDevice(dev_id) );
        int adder = 100;
        for (int did=0; did<dev_id; ++did) {adder*=10;} // adder <== 10**(2+dev_id)
        kernelB<<<numBlocks, blockSize, 0, streams[dev_id]>>>(d_b[dev_id], N, adder);
    }

    // Compute the true exact values that each device must have
    int *checkVals;
    CUDA_CHECK_ERROR( cudaMallocHost((void**)&checkVals, num_devices*num_devices*sizeof(int)) );
    get_exact_values(checkVals, num_devices);

    // synchronize before printing
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        CUDA_CHECK_ERROR( cudaStreamSynchronize(streams[dev_id]) );
    }

    // Print values [from the default stream]
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        CUDA_CHECK_ERROR( cudaSetDevice(dev_id) );
        for (int j=0; j<num_devices; ++j) {
            std::cout << "Printing d_b[" << dev_id << "] from index " << n*j << std::endl;
            std::cout << "expected result: " << checkVals[dev_id*num_devices+j] << std::endl;
            printArray<<<1, 1, 0, 0>>>(&d_b[dev_id][n*j], 4);
            CUDA_CHECK_ERROR( cudaDeviceSynchronize() );
        }
    }

    // check the values [on the default streams]
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        CUDA_CHECK_ERROR( cudaSetDevice(dev_id) );
        CUDA_CHECK_ERROR( cudaMemset(d_countWrong[dev_id], 0, num_devices*sizeof(int)) );
        for (int j=0; j<num_devices; ++j) {
            const int numBlocksR = (n + blockSize - 1)/blockSize;
            kernelCheckValue<<<numBlocksR, blockSize, 0, 0>>>(
                &d_b[dev_id][n*j], 
                n, 
                checkVals[dev_id*num_devices+j], 
                &d_countWrong[dev_id][j]);
            CUDA_CHECK_KERNEL_ERROR
            int h_countWrong = 0;
            CUDA_CHECK_ERROR( cudaMemcpy(&h_countWrong, &d_countWrong[dev_id][j], sizeof(int), cudaMemcpyDeviceToHost) );
            if (h_countWrong>0) {
                std::cout << "device " << dev_id << ", part " << j << " has " << h_countWrong << " WRONG value(s)." << std::endl;
            } else {
                std::cout << "device " << dev_id << ", part " << j << " has all CORRECT values." << std::endl;
            }
        }
    }

    // All cleaning
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        CUDA_CHECK_ERROR( cudaSetDevice(dev_id) );
        CUDA_CHECK_ERROR( cudaDeviceSynchronize() );
        CUDA_CHECK_ERROR( cudaStreamDestroy(streams[dev_id]) );
        CUDA_CHECK_ERROR( cudaFree(d_a[dev_id]) );
        CUDA_CHECK_ERROR( cudaFree(d_b[dev_id]) );
        CUDA_CHECK_ERROR( cudaFree(d_countWrong[dev_id]) );
    }
    CUDA_CHECK_ERROR( cudaFreeHost(checkVals) );

    return 0;
}

It’s possible that the system with the RTX GPUs is broken for P2P, even though the can access peer test returned a positive result.

Do you get good results running the p2pBandwidthLatencyTest cuda sample code?

Interesting. Indeed, the last test (latency with P2P enabled) was running for over 15minutes, so I reduced the repeat to 10 instead of 10k and obtain the result below (7 out of the 8 GPUs of the node). But again, all tests execute but I am not sure that the communication really happens correctly.

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A5000, pciBusID: 4f, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A5000, pciBusID: 52, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA RTX A5000, pciBusID: 57, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA RTX A5000, pciBusID: ce, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA RTX A5000, pciBusID: d1, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA RTX A5000, pciBusID: d5, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA RTX A5000, pciBusID: d6, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Cliques: 
[0 1 2 3 4 5 6]
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6 
     0 304.70  17.32  17.30  20.87  20.77  20.76  20.77 
     1  17.93 320.81  17.96  20.87  20.94  20.69  20.94 
     2  17.88  17.90 320.86  20.90  21.00  20.90  20.98 
     3  21.10  20.94  20.98 323.03  17.33  17.27  17.37 
     4  21.00  20.92  20.96  17.89 322.76  17.83  17.90 
     5  20.87  20.95  20.96  17.81  17.88 321.40  17.87 
     6  20.84  20.85  20.95  17.96  17.93  17.96 323.48 
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6 
     0 321.04   0.73   0.73  18.39  18.46  18.47  18.48 
     1   0.73 277.28   0.73  18.35  18.49  18.50  18.49 
     2   0.73   0.73 271.40  18.49  18.49  18.48  18.50 
     3  18.49  18.48  18.51 297.90   0.73   0.73   0.73 
     4  18.34  18.50  18.51   0.73 289.55   0.73   0.73 
     5  18.49  18.48  18.49   0.44   0.73 317.57   0.73 
     6  18.50  18.50  18.50   0.62   0.73   0.73 285.13 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6 
     0 329.09  18.63  20.32  35.99  36.25  36.39  36.48 
     1  19.04 328.53  19.46  36.08  36.29  36.38  36.50 
     2  19.36  20.17 329.09  36.10  36.35  36.23  36.32 
     3  36.34  36.35  36.24 327.98  18.18  20.01  19.38 
     4  36.25  36.34  36.10  20.14 330.20  19.58  19.48 
     5  36.48  36.55  36.53  20.01  20.43 331.88  19.82 
     6  36.18  36.46  36.43  20.76  19.81  19.81 331.88 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6 
     0 329.36   1.34   0.80  36.95  36.96  36.96  36.97 
     1   1.34 314.01   1.34  36.96  36.97  36.97  36.97 
     2   1.34   1.34 323.63  36.94  36.93  36.94  36.94 
     3  36.74  36.96  36.96 327.16   1.19   1.16   1.16 
     4  36.96  36.56  36.80   0.86 330.48   0.79   0.80 
     5  36.93  36.94  36.94   0.87   0.79 329.64   1.22 
     6  36.97  36.97  36.97   1.34   1.18   1.34 330.48 
P2P=Disabled Latency Matrix (us)
   D\D     0      1      2      3      4      5      6 
     0   3.99  15.93  11.50  13.04  13.80  14.65  17.15 
     1  15.25   4.20  11.76  13.56  13.22  15.98  14.14 
     2  12.79  13.09   4.80  10.54  11.74  13.34  12.79 
     3  14.65  15.52  11.67   4.94  10.52  13.22  13.42 
     4  13.32  14.64  12.11  10.93   4.03  12.45  11.27 
     5  17.52  16.44  13.99  12.40  11.65   2.95  15.07 
     6  13.50  16.34  13.88  13.16  13.05  16.02   2.87 
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3      4      5      6 
     0   3.36 49215.49 49216.41   6.55   6.66   8.09   8.09 
     1 49215.51   3.84 49214.97  15.97   5.63   7.58   7.68 
     2 49215.43 49212.05   4.89   5.63   5.73   7.58   7.37 
     3   6.04   5.37   6.18   4.11 49214.05 49214.57 49213.85 
     4   6.13  12.27   6.86 49215.44   4.78 49215.08 49215.18 
     5   6.85   6.32   7.16 49215.31 49215.23   3.52 49214.57 
     6   5.73  14.23   7.27 49214.57 49213.85 49214.47   3.38 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

That’s abnormal.

That should not be necessary. The test is designed to run in less than a minute, as-is, on a proper modern platform (try it unmodifed on your DGXA100 system if you like.)

The results above marked with an asterisk, for example, are a problem/unexpected. I think the platform is suspect. If you ordered it configured exactly like this from a reputable OEM, then you should take these results to them and ask for resolution. That isn’t something we can do here, nor can NVIDIA fix your platform.

If this is a platform you built yourself, then I would suspect that you have started with a motherboard setup which was not properly designed for this activity.

I don’t think I will be able to help further with a platform issue, but this topic may be of interest.

Thanks a lot, I will report the issue to our cluster admin. Indeed, on the DGX system, the test executes in around 30 seconds.
I will post a message here once this is resolved.

In the end, it was just a matter of disabling ACS in the BIOS.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.