cudaMemcpyDeviceToHost 200 x longer than cudaMemcpyHostToDevice ?

vincente · November 24, 2011, 8:55am

HI
i am newbie in cuda programming (on QuadraFX5800, 4GO, Win7, 64)
I try matrix multiplication. It works well BUT

when analysing performances with nsight, i realize that copying from device to host was 200x longer than host to device !
is it normal ?

Vincent

Nsight report :
cuMemcpyDtoH_v2  1 738 671 Âµs  for a 2000 x 2000 matrix
cuMemcpyHtoD_v2      9 317 Âµs  for a 2000 x 2000 matrix

code:
unsigned int M_DIM_LIG=2000;
unsigned int M_DIM_COL=2000;
unsigned int P_DIM_LIG=M_DIM_LIG;
unsigned int P_DIM_COL=M_DIM_COL;
…
float matM= new float[M_DIM_LIG * M_DIM_COL];
float matPGPU= new float[P_DIM_LIG * P_DIM_COL];
…
cudaMalloc((void **) &devM, M_DIM_COLM_DIM_LIGsizeof(float));
cudaMalloc((void **) &devN, N_DIM_COLN_DIM_LIGsizeof(float));
cudaMalloc((void **) &devP, P_DIM_COLP_DIM_LIGsizeof(float));
…
cudaMemcpy(devM,matM, M_DIM_COLM_DIM_LIGsizeof(float),cudaMemcpyHostToDevice); // 9 317 Âµs
…
cudaMemcpy(matPGPU,devP,P_DIM_COLP_DIM_LIGsizeof(float),cudaMemcpyDeviceToHost); // 1 738 671 Âµs

tera · November 25, 2011, 3:10am

The time for device to host copy probably includes the time for the actual calculations on the device. As kernels are executed asynchronously, but cudaMemcpy() is synchronous, it has to wait for all previously launched kernels (of the same stream) to finish.

Insert a cudaStreamSynchronize(0) before the cudaMemcpy() to only measure the time for the copy operation itself.

vincente · November 25, 2011, 8:20am

as adviced, i have inserted a cudaStreamSynchronize(0) before the cudaMemcpy() :

as expected:
cuMemcpyHtoD_v2 8074Âµs for a 2000 x 2000 matrix
cuStreamSynchronize 1730306Âµs (calculations on device)
cuMemcpyDtoH_v2 8361Âµs for a 2000 x 2000 matrix

THANKS A LOT, tera.

Vincent

Topic		Replies	Views
cudaMemcpyDeviceToHost taking much time? CUDA Programming and Performance	3	2734	July 15, 2009
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15479	April 29, 2014
time of copy CUDA Programming and Performance	0	799	June 18, 2010
Is there any way to copy data from device to host more efficiently in this case? CUDA Programming and Performance	4	1106	December 14, 2018
About CUDA CUDA Programming and Performance	2	4770	December 3, 2008
`cudaMemcpyHostToDevice` is very slow CUDA Programming and Performance	8	2155	December 14, 2018
Why cudaMemcpyDeviceToHost is too slowly? CUDA Programming and Performance	1	684	November 16, 2021
cudaMemcpy CUDA Programming and Performance	0	1238	November 20, 2008
cudaMemcpy takes large time? CUDA Programming and Performance	2	2015	June 8, 2009
Copy back to host lasts much longer than copy to device, why? CUDA Programming and Performance	3	757	December 11, 2013

cudaMemcpyDeviceToHost 200 x longer than cudaMemcpyHostToDevice ?

Related topics