1st and 2nd Memcopy timing details

Dear Experts,

Consider the following code segment;

cudaMemcpy(…); // 1st Memcpy
CudaKernel <<< …, … >>> ( …);
cudaMemcpy( …); // 2nd Memcpy

It is observed that whenever I profile my code on the GPU, the 2nd Memcopy takes more time compared to the first one. Both Memcopy copy the same number of elements and there are no uncaolesced memory accesses in both cases. I am using GTX 260.

Regards,

Sajid Anwar Khan

This is a sample output from cuda profiler

First memcopy method=[ memcopy ] gputime=[ 13336.896 ] cputime=[ 23372.000 ]
and
Second memcopy = method=[ memcopy ] gputime=[ 29527.936 ] cputime=[ 60769.000 ]

Same number of elements are copied. What can be the reason ?

Since u dont have cudaThreadSynchronize(), the memcopy time also includes the kernel execution time.

Kernel calls are asynchronous.

Insert a “cudaThreadSynchronize” between the kernel and 2nd memcopy and u will see the difference.

Also note that the first CUDA Call always takes more time because of cuda initialization. Usually this is a “cudaMalloc” and hence is irrelevant to the above scenario. JFYI.