Different total amount of data transfer from Device to Host

I have a loop in which the code has 2 statement of copy data from Device to Host. The amount of memory to copy in each statement are the same. However, the first one is a single column in an array and the second one is a vector.

X(row_size)
X_dev(row_size, column_size)
X = X_dev(:,1)

Y(row_size)
Y_dev(row_size)
Y = Y_dev

I did this using CUDA Fortran (column-major). With the same number of calls, yet the total “mem transfer size (bytes)” that I retrieved from the cudaprof are different.
Method | #calls | mem data transfer (bytes)
memcpyDtoH | 55800 | 242688 (for X=X(:,1))
memcpyDtoH | 55800 | 80896 (for Y=Y_dev)

Could someone give me a reasonable explanation? ONE IMPORTANT notice is that the Y locate on page-locked while X locates on pageable host memory.

Thanks,
Tuan

I have a loop in which the code has 2 statement of copy data from Device to Host. The amount of memory to copy in each statement are the same. However, the first one is a single column in an array and the second one is a vector.

X(row_size)
X_dev(row_size, column_size)
X = X_dev(:,1)

Y(row_size)
Y_dev(row_size)
Y = Y_dev

I did this using CUDA Fortran (column-major). With the same number of calls, yet the total “mem transfer size (bytes)” that I retrieved from the cudaprof are different.
Method | #calls | mem data transfer (bytes)
memcpyDtoH | 55800 | 242688 (for X=X(:,1))
memcpyDtoH | 55800 | 80896 (for Y=Y_dev)

Could someone give me a reasonable explanation? ONE IMPORTANT notice is that the Y locate on page-locked while X locates on pageable host memory.

Thanks,
Tuan