I have a loop in which the code has 2 statement of copy data from Device to Host. The amount of memory to copy in each statement are the same. However, the first one is a single column in an array and the second one is a vector.
X = X_dev(:,1)
Y = Y_dev
I did this using CUDA Fortran (column-major). With the same number of calls, yet the total “mem transfer size (bytes)” that I retrieved from the cudaprof are different.
Method | #calls | mem data transfer (bytes)
memcpyDtoH | 55800 | 242688 (for X=X(:,1))
memcpyDtoH | 55800 | 80896 (for Y=Y_dev)
Could someone give me a reasonable explanation? ONE IMPORTANT notice is that the Y locate on page-locked while X locates on pageable host memory.