cudaMemcpy takes large time?

Hi All,

In my Code

cudaMemcpy(dst, src , dstSize, cudaMemcpyDeviceToHost);

Here
dst = Host ponter
src = device pointer
dstSize = 2000*1000 *sizeof(unsigned char)

This cudaMemcpy takes 16 ms time which very costy .

Can any one help why it takes so much time?

Probably because your time measurement includes the kernel execution time. Kernel launches are asynchronous, so unless you are including an explicit syncthreads() barrier after your kernel call, your host timer will spinlock at the cudaMemcpy() until the kernel finishes running and the copy is executed.

Also, was the host memory pinned (i.e. allocated via cudaMallocHost)? If not, the copy will take roughly twice as long.