cudaMemcpyDeviceToHost 200 x longer than cudaMemcpyHostToDevice ?

i am newbie in cuda programming (on QuadraFX5800, 4GO, Win7, 64)
I try matrix multiplication. It works well BUT

when analysing performances with nsight, i realize that copying from device to host was 200x longer than host to device !
is it normal ?


Nsight report :
cuMemcpyDtoH_v2  1 738 671 µs  for a 2000 x 2000 matrix
cuMemcpyHtoD_v2      9 317 µs  for a 2000 x 2000 matrix

unsigned int M_DIM_LIG=2000;
unsigned int M_DIM_COL=2000;
unsigned int P_DIM_LIG=M_DIM_LIG;
unsigned int P_DIM_COL=M_DIM_COL;

float matM= new float[M_DIM_LIG * M_DIM_COL];
float matPGPU= new float[P_DIM_LIG * P_DIM_COL];

cudaMalloc((void **) &devM, M_DIM_COL
cudaMalloc((void **) &devN, N_DIM_COLN_DIM_LIGsizeof(float));
cudaMalloc((void **) &devP, P_DIM_COLP_DIM_LIGsizeof(float));

cudaMemcpy(devM,matM, M_DIM_COLM_DIM_LIGsizeof(float),cudaMemcpyHostToDevice); // 9 317 µs

cudaMemcpy(matPGPU,devP,P_DIM_COLP_DIM_LIGsizeof(float),cudaMemcpyDeviceToHost); // 1 738 671 µs

The time for device to host copy probably includes the time for the actual calculations on the device. As kernels are executed asynchronously, but cudaMemcpy() is synchronous, it has to wait for all previously launched kernels (of the same stream) to finish.

Insert a cudaStreamSynchronize(0) before the cudaMemcpy() to only measure the time for the copy operation itself.

as adviced, i have inserted a cudaStreamSynchronize(0) before the cudaMemcpy() :

as expected:
cuMemcpyHtoD_v2 8074µs for a 2000 x 2000 matrix
cuStreamSynchronize 1730306µs (calculations on device)
cuMemcpyDtoH_v2 8361µs for a 2000 x 2000 matrix