Why cudaMemcpyDeviceToHost is too slowly?

HI sir:
I have a question about cudaMemcpyDeviceToHost. In figure 1. image data from device copy to host is very slowly. But in figure 2. from host copy to host is fast. I am so confused. How to improve speed cudaMemcpyDeviceToHost speed?
thanks.

Image 10.bmp (5.9 MB)
Image 10.bmp (5.9 MB)

This is a common question. CUDA kernel launches and various CUDA library calls are asynchronous. This means they return control to the host thread before the operation is complete.

The subsequent cudaMemcpy operation, on the other hand, blocks the host thread until the previous CUDA operations are complete. So you are not timing what you think you are timing. To “fix” this, one possible approach would be like this:

...
cudaDeviceSynchronize();  // add this line
start = clock();
cudaMemcpy(dstCuda, ..., cudaMemcpyDeviceToHost);
end = clock();
...

Note that to make your timing of the previous nppiFilterBoxBorder function “correct”, you might actually want to put that cudaDeviceSynchronize(); call before the previous end=clock(); statement associated with timing of the npp function.