Inconsistent cudaMemcpy execution time

caduc3us · October 8, 2013, 9:14pm

Hi, I need to transfer RGB frames of 1920x1080 size to GPU for processing them. I am using cudaMemcpyAsync with non-default stream to copy data to GPU. I profiled the application using Visual profiler and i can see that the time taken by cudaMemcpy is inconsistent. Sometimes it takes 3-4 ms and sometimes 40-50 ms. I also observed that the throughput of memory transfer varies. What I might be doing wrong here…

I need to process frames in real time to get FPS of at least 60 on 1920x1080 frames. Any suggestions are welcome.

cmaster.matso · October 11, 2013, 4:54am

Hi,
Wouldn’t it be better to use CUDA graphics interoperability? Register and map resource (the frame data), precess it in CUDA, unmap and pass back to display? Or is this a case that You ARE using such schema?

MK

tbenson · October 11, 2013, 2:25pm

Are you by chance using the cudaDeviceScheduleBlockingSync setting with a Kepler card? If so, I filed a similar bug with a test case earlier this year. That has now reportedly been fixed and the fix will be included in the next CUDA release. However, it did not seem to impact the default schedule setting (cudaDeviceScheduleSpin), so if you are not using BlockingSync then it is likely a different issue.

Arakageeta · October 24, 2013, 9:34pm

I think I’ve been tracking the same bug for quite some time (I reported it with a reproducible test case about this time last year). However, the bug I saw wasn’t restricted to Kepler. I could reproduce it on Fermi as well. The bug was introduced by the 3xx-series kernel drivers. (I could reproduce the bug with CUDA 4.0, 3xx drivers, on Fermi.) I’ve noticed that the variable delays are not in recent kernel drivers. HOWEVER, this is because BlockingSync is ignored and the CPU thread spins. I’m glad to hear that a fix is eminent (CUDA 6.0?). It’s interesting that the bug would be fixed by a CUDA release—my experience suggests that the problem lies within the kernel driver. I had hoped that a fix might come in the form of a kernel driver, since these are released more frequently than CUDA.

caduc3us: If you’re using a Fermi GPU, try downgrading to a 27x or 29x kernel driver. You may have to reinstall CUDA.

tbenson · October 25, 2013, 1:40pm

Here is the test case that I had submitted:

#include <cuda.h>
#include <cassert>
#include <cstdio>
#include <time.h>
#include <unistd.h>

#define NITER 8

__global__ void SimpleKernel(double *x, double y)
{
    double arg = y;
    for (int k = 0; k < 10000; ++k)
    {
        arg = sin(arg);
    }
    *x = arg;
}

double ElapsedTimeMs(struct timespec initTime)
{
    struct timespec t;
    assert(clock_gettime(CLOCK_REALTIME, &t) == 0);
    double diff_ms = (1000.0 * t.tv_sec + 1.0e-6 * t.tv_nsec) -
        (1000.0 * initTime.tv_sec + 1.0e-6 * initTime.tv_nsec);
    return diff_ms;
}

int main(int argc, char **argv)
{
    double *dev_x, *x;
    cudaStream_t stream;

    assert(cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) == cudaSuccess);

    struct timespec initTime;
    assert(clock_gettime(CLOCK_REALTIME, &initTime) == 0);

    assert(cudaStreamCreate(&stream) == cudaSuccess);
    assert(cudaMalloc((void **) &dev_x, NITER*sizeof(double)) == cudaSuccess);
    assert(cudaMemset(dev_x, 0, sizeof(double)*NITER) == cudaSuccess);
    assert(cudaDeviceSynchronize() == cudaSuccess);

    x = new double[NITER];
    memset(x, 0, sizeof(double)*NITER);

    printf("Synchronizing between the kernel and async memcpy:\n");
    for (int k = 0; k < NITER; ++k)
    {
        double start = ElapsedTimeMs(initTime);
        SimpleKernel<<<1,1,0,stream>>>(dev_x+k,42);
        assert(cudaStreamSynchronize(stream) == cudaSuccess);
        assert(cudaMemcpyAsync(
            x+k, dev_x+k, sizeof(double), cudaMemcpyDeviceToHost, stream) == cudaSuccess);
        assert(cudaStreamSynchronize(stream) == cudaSuccess);
        double elapsed = ElapsedTimeMs(initTime)-start;
        printf("Iteration %d time: %.2f ms\n", k, elapsed);
    }
    printf("\nNot synchronizing between the kernel and async memcpy:\n");
    for (int k = 0; k < NITER; ++k)
    {
        double start = ElapsedTimeMs(initTime);
        SimpleKernel<<<1,1,0,stream>>>(dev_x+k,42);
        assert(cudaMemcpyAsync(
            x+k, dev_x+k, sizeof(double), cudaMemcpyDeviceToHost, stream) == cudaSuccess);
        assert(cudaStreamSynchronize(stream) == cudaSuccess);
        double elapsed = ElapsedTimeMs(initTime)-start;
        printf("Iteration %d time: %.2f ms\n", k, elapsed);
    }

    delete [] x;
}

I got consistent run-times on an M2090 (under 5 ms), but on the K20c it would bounce between 6 ms and 100 ms for the non-synchronizing case (and be stable between 6 and 7 ms for the synchronizing case). That was with CUDA 5.0 and driver 310.44. It will reportedly be fixed in the next CUDA release, but I don’t have any additional details on the issue/fix.

Topic		Replies	Views
Inconsistent cudaMemcpy Timing cudaMemcpy and kernel timing hiccups at 1 second intervals CUDA Programming and Performance	1	1146	October 6, 2010
Why some synchronize function make cudaMemcpyAsync and kernal in different stream work in sequential CUDA Programming and Performance	2	6609	March 1, 2011
Odd cudaMemcpyAsync() behavior with Kepler K20c and CUDA 5.0 CUDA Programming and Performance	0	977	January 14, 2013
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25292	March 8, 2010
cudaMemcpy behaving asynchronously with drivers 471.11+ CUDA Programming and Performance cuda , nvbugs	7	1496	July 21, 2021
cudaMemcpy2DAsync not always fully synchronous CUDA Programming and Performance	11	1284	February 4, 2021
more touch, more time CUDA Programming and Performance	9	2086	April 23, 2010
Synchronous cudaMemcpy locks GPU for other host threads/programs? CUDA Programming and Performance	1	894	June 13, 2011
Stay in blocking status on cudaMemcpyAsync() function over 100 msec DRIVE AGX Orin General driveos-dl	13	1436	March 31, 2023
cudaMemcpyAsync CUDA Programming and Performance	10	22046	October 16, 2015

Inconsistent cudaMemcpy execution time

Related topics