cudaMemcpy takes 30% of my project time.

Hi All,

In my program , after doing all the calculations , I have to return back resultant image of size 2400 * 1800 from device to host using cudaMemcpy. But it takes 21 ms , which is very expensive for my program because it takes 30% of over all execution time.

Could any one tell why it happens ?

and How to reduce cudaMemcpy execution time?

Try using pinned memory

You could also try to overlap copying of the current result and launching the next loop’s kernel iteration (well actually the other way round).

Since kernel invocations are async you could do something like this:

// Prepare output space 1.

  // Run kernel iteration 1.

  // Run kernel iteration 2 --> this is async

  // Copy kernel iteration 1 output to output space 1.

  cudaThreadSynchronize();   --> Very Important..

  // Run kernel iteration 3 --> this is async.

  // Copy kernel iteration 2 output...

  cudaThreadSynchronize();  --> Very important.



Look for overlapping in google or maybe in the SDKs/samples.


I am using pinned memory in my kernel and declaring like :

when I wirte

I got the error cudaErrorPriorLaunchFailure .

What is the cause of that error? please help.

Actually my cudaMemcpy is the last call of my program i.e. copy the final output image from device to host. So,

How can use overlap copy?

try using:

cudaMallocHost instead of cudaMalloc


cudaFreeHost instead of cudaFree

declarations are exactly the same you just add “Host” to call diffrent function

it really speeded up my cudaMemcpy operations by a ?_? x100+ times (3ms -> ~0.018ms this is just rough figures). it should do wonders for you as your operation takes 21 ms.