Consider the following code segment;
cudaMemcpy(…); // 1st Memcpy
CudaKernel <<< …, … >>> ( …);
cudaMemcpy( …); // 2nd Memcpy
It is observed that whenever I profile my code on the GPU, the 2nd Memcopy takes more time compared to the first one. Both Memcopy copy the same number of elements and there are no uncaolesced memory accesses in both cases. I am using GTX 260.
Sajid Anwar Khan