In my program , after doing all the calculations , I have to return back resultant image of size 2400 * 1800 from device to host using cudaMemcpy. But it takes 21 ms , which is very expensive for my program because it takes 30% of over all execution time.
You could also try to overlap copying of the current result and launching the next loop’s kernel iteration (well actually the other way round).
Since kernel invocations are async you could do something like this:
// Prepare output space 1.
// Run kernel iteration 1.
// Run kernel iteration 2 --> this is async
// Copy kernel iteration 1 output to output space 1.
cudaThreadSynchronize(); --> Very Important..
// Run kernel iteration 3 --> this is async.
// Copy kernel iteration 2 output...
cudaThreadSynchronize(); --> Very important.
...
...
Look for overlapping in google or maybe in the SDKs/samples.
declarations are exactly the same you just add “Host” to call diffrent function
it really speeded up my cudaMemcpy operations by a ?_? x100+ times (3ms → ~0.018ms this is just rough figures). it should do wonders for you as your operation takes 21 ms.