I have a question about cudaMemcpyDeviceToHost. In figure 1. image data from device copy to host is very slowly. But in figure 2. from host copy to host is fast. I am so confused. How to improve speed cudaMemcpyDeviceToHost speed?
This is a common question. CUDA kernel launches and various CUDA library calls are asynchronous. This means they return control to the host thread before the operation is complete.
cudaMemcpy operation, on the other hand, blocks the host thread until the previous CUDA operations are complete. So you are not timing what you think you are timing. To “fix” this, one possible approach would be like this:
... cudaDeviceSynchronize(); // add this line start = clock(); cudaMemcpy(dstCuda, ..., cudaMemcpyDeviceToHost); end = clock(); ...
Note that to make your timing of the previous nppiFilterBoxBorder function “correct”, you might actually want to put that
cudaDeviceSynchronize(); call before the previous
end=clock(); statement associated with timing of the npp function.