Device to Device cudaMemcpy performance

Can someone kindly explain why GB/s for device to device cudaMemcpy shows an increasing trend?

Conversely, doing a memcpy on CPU gives an expected behavior of step-wise decreasing GB/s as data size increases, initially giving higher GB/s as data can fit in cache and then decreasing as data gets bigger as it is fetched from off chip memory.

I can’t explain the behavior of device to device cudaMemcpy and hope that someone can give a logical insight into its performance with increasing bytes. Thank you.

A table or graph accompanying the question would have been helpful. I have no idea what size of transfers you may be looking at, and what kind of timing data you are seeing.

I boldly claim that you will find a similar effect (copy throughput increases with increasing transfer size until some maximum is reached) when you run a benchmark like STREAM to assess the memory throughput of system memory.

A simple mental model of this is that, generally speaking, copy performance continues to increase until the number of cores / threads engaged fully utilize all hardware engaged in the copying data, and in particular maximizes the use of internal buffering mechanisms. There are also various fixed-sized “startup” overheads when a processor talks to DRAM, and these will impact small transfers more heavily. Full buffers aid throughput maximization, but have a negative impact on the latency of the transfer of any particular piece of data.

Modern high-end CPUs tend to have multiple layers of large caches which “mask” the lower copy performance of system memory for small transfer sizes. Compared to GPUs, CPUs also offer significantly fewer units working in parallel, allowing hardware resources to become “saturated” and operate at “steady state” at smaller transfer sizes compared to GPUs, resulting in a shallower curve.

I am not clear whether the different kind of DRAM used (DDR4 for system memory vs GDDR6(X) / HBM2 for the GPU) cause the effect of higher throughput at higher transfer sizes to be more pronounced for GPUs. I lack the detailed knowledge of modern memory types; the last time I was familiar with all the details of operation was when DDR2 was the dominant DRAM technology.

I am attaching a graph. Data size in KB is on X-axis, while Y-axis is in GB/s. I used cuda events to time cudaMemcpy on GeForce RTX 2080Ti and gettimeofday() to time memcpy on AMD 2950X 16-Core Processor. I get the memcpy trend.

For cudaMemcpy trend: 1) Why GB/s so low at smaller data sizes, seem counter intuitive? 2) Why GB/s increasing with increasing data size? 3) Why there is a peak at 2048KB? It appears to plateau afterwards.

The CPU results seem way too low, considering the high-end CPU with four channels of DDR4-2933. That should yield a memory bandwidth of about 95 GB/sec, so the copy throughput would be half that.

I would suggest running the STREAM benchmark. Given the CPU architecture with multiple core complexes each with their own DRAM controllers, would will want to run multi-threaded. STREAM includes four different kernels, one of which is COPY. You might want to make a shmoo-plot of STREAM COPY performance for block sizes of 8 MB to 512 MB or so.

I already spoke to questions (1) and (2). If you are looking for a more detailed explanation than that: hang around a bit, maybe someone more knowledgeable will chime in and fill in more details.

As for question (3): This could be an artifact cause by your timing methodology. Maybe the machine wasn’t idle. Maybe the test did not use best-out-of-ten reporting. It could also be an artifact of the sizing or policies of various memory-related hardware mechanisms interacting with specific address spacing or transfer sizes. Such interactions could be detrimental to performance (e.g. thrashing in page tables) or synergistic. When generated with high resolution such performance curves are rarely smooth.

The general shape of the curve is predicted by a fairly simple model. cudaMemcpy is part of the runtime API. It is a host code library call that triggers the underlying transfer. The runtime API has an “overhead” that is roughly constant independent of the transfer size. Therefore we can model the transfer cost as a function combining the fixed API cost plus the actual duration of the transfer. When we do that, the prediction roughly lines up with your measurement.

The GPU memory bandwidth data looks low as well. My Quadro RTX 4000 has a theoretical memory bandwidth of 416 GB/sec (256-bit interface operating at 1625 MHz). Maximum bandwidth measured with my ZCOPY program below is 379 GB/sec:

C:\Users\Norbert\My Programs>zcopy -n100000000
zcopy: using vectors of 100000000 double2 elements (= 1.600e+09 bytes)
zcopy: using 128 threads per block, 781250 blocks
zcopy: mintime = 8.451 msec  memory bandwith = 378.64 GB/sec

With zcopy -n16777216 it is still 360 GB/sec. The RTX 2080 Ti should have a memory bandwidth of 500+ GB/sec; the copy rate would be half that. Using memcpy() and cudaMemcpy() may not be suitable for determining memory bandwidth, but I haven’t looked into it, preferring to use code where I can see what it does.

#include <stdlib.h>
#include <stdio.h>

#define ZCOPY_THREADS  128
#define ZCOPY_DEFLEN   30000000
#define ZCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#include <windows.h>
double second (void)
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
#error unsupported platform

__global__ void zcopy (const double2 * __restrict__ src, 
                       double2 * __restrict__ dst, int len)
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];

struct zcopyOpts {
    int len;

static int processArgs (int argc, char *argv[], struct zcopyOpts *opts)
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
    return error;

int main (int argc, char *argv[])
    double start, stop, elapsed, mintime;
    double2 *d_a, *d_b;
    int errors;
    struct zcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    opts.len = (opts.len) ? opts.len : ZCOPY_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(ZCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);
    printf ("zcopy: using vectors of %d double2 elements (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("zcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < ZCOPY_ITER; k++) {
        start = second();
        zcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    printf ("zcopy: mintime = %.3f msec  memory bandwith = %.2f GB/sec\n",
            1.0e3 * mintime, (2 * sizeof(d_a[0]) * opts.len) / (1e9 * mintime));

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;