Device to Device cudaMemcpy performance

StereoGraphics · March 24, 2021, 2:58am

Can someone kindly explain why GB/s for device to device cudaMemcpy shows an increasing trend?

Conversely, doing a memcpy on CPU gives an expected behavior of step-wise decreasing GB/s as data size increases, initially giving higher GB/s as data can fit in cache and then decreasing as data gets bigger as it is fetched from off chip memory.

I can’t explain the behavior of device to device cudaMemcpy and hope that someone can give a logical insight into its performance with increasing bytes. Thank you.

njuffa · March 24, 2021, 5:12am

A table or graph accompanying the question would have been helpful. I have no idea what size of transfers you may be looking at, and what kind of timing data you are seeing.

I boldly claim that you will find a similar effect (copy throughput increases with increasing transfer size until some maximum is reached) when you run a benchmark like STREAM to assess the memory throughput of system memory.

A simple mental model of this is that, generally speaking, copy performance continues to increase until the number of cores / threads engaged fully utilize all hardware engaged in the copying data, and in particular maximizes the use of internal buffering mechanisms. There are also various fixed-sized “startup” overheads when a processor talks to DRAM, and these will impact small transfers more heavily. Full buffers aid throughput maximization, but have a negative impact on the latency of the transfer of any particular piece of data.

Modern high-end CPUs tend to have multiple layers of large caches which “mask” the lower copy performance of system memory for small transfer sizes. Compared to GPUs, CPUs also offer significantly fewer units working in parallel, allowing hardware resources to become “saturated” and operate at “steady state” at smaller transfer sizes compared to GPUs, resulting in a shallower curve.

I am not clear whether the different kind of DRAM used (DDR4 for system memory vs GDDR6(X) / HBM2 for the GPU) cause the effect of higher throughput at higher transfer sizes to be more pronounced for GPUs. I lack the detailed knowledge of modern memory types; the last time I was familiar with all the details of operation was when DDR2 was the dominant DRAM technology.

StereoGraphics · March 24, 2021, 6:16am

I am attaching a graph. Data size in KB is on X-axis, while Y-axis is in GB/s. I used cuda events to time cudaMemcpy on GeForce RTX 2080Ti and gettimeofday() to time memcpy on AMD 2950X 16-Core Processor. I get the memcpy trend.

For cudaMemcpy trend: 1) Why GB/s so low at smaller data sizes, seem counter intuitive? 2) Why GB/s increasing with increasing data size? 3) Why there is a peak at 2048KB? It appears to plateau afterwards.

njuffa · March 24, 2021, 7:37am

The CPU results seem way too low, considering the high-end CPU with four channels of DDR4-2933. That should yield a memory bandwidth of about 95 GB/sec, so the copy throughput would be half that.

I would suggest running the STREAM benchmark. Given the CPU architecture with multiple core complexes each with their own DRAM controllers, would will want to run multi-threaded. STREAM includes four different kernels, one of which is COPY. You might want to make a shmoo-plot of STREAM COPY performance for block sizes of 8 MB to 512 MB or so.

I already spoke to questions (1) and (2). If you are looking for a more detailed explanation than that: hang around a bit, maybe someone more knowledgeable will chime in and fill in more details.

As for question (3): This could be an artifact cause by your timing methodology. Maybe the machine wasn’t idle. Maybe the test did not use best-out-of-ten reporting. It could also be an artifact of the sizing or policies of various memory-related hardware mechanisms interacting with specific address spacing or transfer sizes. Such interactions could be detrimental to performance (e.g. thrashing in page tables) or synergistic. When generated with high resolution such performance curves are rarely smooth.

Robert_Crovella · March 24, 2021, 2:56pm

The general shape of the curve is predicted by a fairly simple model. cudaMemcpy is part of the runtime API. It is a host code library call that triggers the underlying transfer. The runtime API has an “overhead” that is roughly constant independent of the transfer size. Therefore we can model the transfer cost as a function combining the fixed API cost plus the actual duration of the transfer. When we do that, the prediction roughly lines up with your measurement.

njuffa · March 24, 2021, 7:23pm

The GPU memory bandwidth data looks low as well. My Quadro RTX 4000 has a theoretical memory bandwidth of 416 GB/sec (256-bit interface operating at 1625 MHz). Maximum bandwidth measured with my ZCOPY program below is 379 GB/sec:

C:\Users\Norbert\My Programs>zcopy -n100000000
zcopy: using vectors of 100000000 double2 elements (= 1.600e+09 bytes)
zcopy: using 128 threads per block, 781250 blocks
zcopy: mintime = 8.451 msec  memory bandwith = 378.64 GB/sec

With zcopy -n16777216 it is still 360 GB/sec. The RTX 2080 Ti should have a memory bandwidth of 500+ GB/sec; the copy rate would be half that. Using memcpy() and cudaMemcpy() may not be suitable for determining memory bandwidth, but I haven’t looked into it, preferring to use code where I can see what it does.

#include <stdlib.h>
#include <stdio.h>

#define ZCOPY_THREADS  128
#define ZCOPY_DEFLEN   30000000
#define ZCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void zcopy (const double2 * __restrict__ src, 
                       double2 * __restrict__ dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];
    }
}    

struct zcopyOpts {
    int len;
};

static int processArgs (int argc, char *argv[], struct zcopyOpts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double start, stop, elapsed, mintime;
    double2 *d_a, *d_b;
    int errors;
    struct zcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : ZCOPY_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(ZCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);
    
    printf ("zcopy: using vectors of %d double2 elements (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("zcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < ZCOPY_ITER; k++) {
        start = second();
        zcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("zcopy: mintime = %.3f msec  memory bandwith = %.2f GB/sec\n",
            1.0e3 * mintime, (2 * sizeof(d_a[0]) * opts.len) / (1e9 * mintime));

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}

Topic		Replies	Views
Memory copy by two CUDA kernels - why speed differs? CUDA Programming and Performance	10	660	September 28, 2018
How to achieve peak memory bandwidth？ CUDA Programming and Performance	5	2648	July 17, 2020
Slow device to device memory copy CUDA Programming and Performance	7	619	March 17, 2019
NVIDIA Quadro RTX 8000 bandwidthTest Theoretical Max Results CUDA Programming and Performance	4	903	August 9, 2021
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Fast copy (device->kernel) after aync kernel call CUDA Programming and Performance	11	3189	July 6, 2015
PCIe A100s - Slow PCIe speed tests? CUDA Programming and Performance	2	1221	April 12, 2021
how to speed up? data transfer CUDA Programming and Performance	22	3681	April 5, 2011
Quadro 4000 Bandwidth The device to device bandwidth obtained with CUDA Programming and Performance	8	3513	March 7, 2011
3D device-to-device memcopy to cudaArray slow? CUDA Programming and Performance	8	11645	January 14, 2010

Device to Device cudaMemcpy performance

Related topics