Inconsistent cudaMemcpy execution time

Hi, I need to transfer RGB frames of 1920x1080 size to GPU for processing them. I am using cudaMemcpyAsync with non-default stream to copy data to GPU. I profiled the application using Visual profiler and i can see that the time taken by cudaMemcpy is inconsistent. Sometimes it takes 3-4 ms and sometimes 40-50 ms. I also observed that the throughput of memory transfer varies. What I might be doing wrong here…

I need to process frames in real time to get FPS of at least 60 on 1920x1080 frames. Any suggestions are welcome.

Hi,
Wouldn’t it be better to use CUDA graphics interoperability? Register and map resource (the frame data), precess it in CUDA, unmap and pass back to display? Or is this a case that You ARE using such schema?

MK

Are you by chance using the cudaDeviceScheduleBlockingSync setting with a Kepler card? If so, I filed a similar bug with a test case earlier this year. That has now reportedly been fixed and the fix will be included in the next CUDA release. However, it did not seem to impact the default schedule setting (cudaDeviceScheduleSpin), so if you are not using BlockingSync then it is likely a different issue.

I think I’ve been tracking the same bug for quite some time (I reported it with a reproducible test case about this time last year). However, the bug I saw wasn’t restricted to Kepler. I could reproduce it on Fermi as well. The bug was introduced by the 3xx-series kernel drivers. (I could reproduce the bug with CUDA 4.0, 3xx drivers, on Fermi.) I’ve noticed that the variable delays are not in recent kernel drivers. HOWEVER, this is because BlockingSync is ignored and the CPU thread spins. I’m glad to hear that a fix is eminent (CUDA 6.0?). It’s interesting that the bug would be fixed by a CUDA release—my experience suggests that the problem lies within the kernel driver. I had hoped that a fix might come in the form of a kernel driver, since these are released more frequently than CUDA.

caduc3us: If you’re using a Fermi GPU, try downgrading to a 27x or 29x kernel driver. You may have to reinstall CUDA.

Here is the test case that I had submitted:

#include <cuda.h>
#include <cassert>
#include <cstdio>
#include <time.h>
#include <unistd.h>

#define NITER 8

__global__ void SimpleKernel(double *x, double y)
{
    double arg = y;
    for (int k = 0; k < 10000; ++k)
    {
        arg = sin(arg);
    }
    *x = arg;
}

double ElapsedTimeMs(struct timespec initTime)
{
    struct timespec t;
    assert(clock_gettime(CLOCK_REALTIME, &t) == 0);
    double diff_ms = (1000.0 * t.tv_sec + 1.0e-6 * t.tv_nsec) -
        (1000.0 * initTime.tv_sec + 1.0e-6 * initTime.tv_nsec);
    return diff_ms;
}

int main(int argc, char **argv)
{
    double *dev_x, *x;
    cudaStream_t stream;

    assert(cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) == cudaSuccess);

    struct timespec initTime;
    assert(clock_gettime(CLOCK_REALTIME, &initTime) == 0);

    assert(cudaStreamCreate(&stream) == cudaSuccess);
    assert(cudaMalloc((void **) &dev_x, NITER*sizeof(double)) == cudaSuccess);
    assert(cudaMemset(dev_x, 0, sizeof(double)*NITER) == cudaSuccess);
    assert(cudaDeviceSynchronize() == cudaSuccess);

    x = new double[NITER];
    memset(x, 0, sizeof(double)*NITER);

    printf("Synchronizing between the kernel and async memcpy:\n");
    for (int k = 0; k < NITER; ++k)
    {
        double start = ElapsedTimeMs(initTime);
        SimpleKernel<<<1,1,0,stream>>>(dev_x+k,42);
        assert(cudaStreamSynchronize(stream) == cudaSuccess);
        assert(cudaMemcpyAsync(
            x+k, dev_x+k, sizeof(double), cudaMemcpyDeviceToHost, stream) == cudaSuccess);
        assert(cudaStreamSynchronize(stream) == cudaSuccess);
        double elapsed = ElapsedTimeMs(initTime)-start;
        printf("Iteration %d time: %.2f ms\n", k, elapsed);
    }
    printf("\nNot synchronizing between the kernel and async memcpy:\n");
    for (int k = 0; k < NITER; ++k)
    {
        double start = ElapsedTimeMs(initTime);
        SimpleKernel<<<1,1,0,stream>>>(dev_x+k,42);
        assert(cudaMemcpyAsync(
            x+k, dev_x+k, sizeof(double), cudaMemcpyDeviceToHost, stream) == cudaSuccess);
        assert(cudaStreamSynchronize(stream) == cudaSuccess);
        double elapsed = ElapsedTimeMs(initTime)-start;
        printf("Iteration %d time: %.2f ms\n", k, elapsed);
    }

    delete [] x;
}

I got consistent run-times on an M2090 (under 5 ms), but on the K20c it would bounce between 6 ms and 100 ms for the non-synchronizing case (and be stable between 6 and 7 ms for the synchronizing case). That was with CUDA 5.0 and driver 310.44. It will reportedly be fixed in the next CUDA release, but I don’t have any additional details on the issue/fix.