In my program , after doing all the calculations , I have to return back resultant image of size 2400 * 1800 from device to host using cudaMemcpy. But it takes 21 ms , which is very expensive for my program because it takes 30% of over all execution time.
You could also try to overlap copying of the current result and launching the next loop’s kernel iteration (well actually the other way round).
Since kernel invocations are async you could do something like this:
// Prepare output space 1.
// Run kernel iteration 1.
// Run kernel iteration 2 --> this is async
// Copy kernel iteration 1 output to output space 1.
cudaThreadSynchronize(); --> Very Important..
// Run kernel iteration 3 --> this is async.
// Copy kernel iteration 2 output...
cudaThreadSynchronize(); --> Very important.
Look for overlapping in google or maybe in the SDKs/samples.