It is observed that whenever I profile my code on the GPU, the 2nd Memcopy takes more time compared to the first one. Both Memcopy copy the same number of elements and there are no uncaolesced memory accesses in both cases. I am using GTX 260.
Since u dont have cudaThreadSynchronize(), the memcopy time also includes the kernel execution time.
Kernel calls are asynchronous.
Insert a “cudaThreadSynchronize” between the kernel and 2nd memcopy and u will see the difference.
Also note that the first CUDA Call always takes more time because of cuda initialization. Usually this is a “cudaMalloc” and hence is irrelevant to the above scenario. JFYI.