Copy back to host lasts much longer than copy to device, why?

JohnDoes · December 9, 2013, 4:34pm

Hi, I have written a little kernel and copied the data to the device and after the execution from the device back to host. My code looks like this (simplified):

//copy to operations
cudaMemcpy(d_values,values,size_values,cudaMemcpyHostToDevice);
cudaMemcpy(d_fitness,fitness,size_fitness,cudaMemcpyHostToDevice);

//kernel
compute<<<BLOCKNUMBER, THREADSIZE>>>(d_fitness, d_values,
				fRand(0, 100.0));
//copy back operations
cudaMemcpy((void*) fitness, d_fitness, size_fitness,cudaMemcpyDeviceToHost);
				
cudaMemcpy((void*) values, d_values, size_values,cudaMemcpyDeviceToHost);

If I measure the time with the copy operations I get 1.99143119 seconds.
If I measure the time without the copy back operations I get 0.617714564 seconds.
If i measure the time without the copy to and with the copy back operations I get 1.49729565 seconds.

So the copy back to host operations last much longer than all other operations, but why? I copy the same data back as I have copied TO the device. What is the problem?
Thank you very much in advance for your time and help!

JohnDoes · December 11, 2013, 8:51am

Not any idea? Are here perhaps any synchronization issues?

Gert-Jan · December 11, 2013, 12:12pm

When you remove the copy back operations, you are measuring the time of the first copy operations and the kernel launch, not the kernel duration since calling a kernel is asynchronous. You need to call the cudaDeviceSynchronize() function after you call the kernel to measure the execution time correctly. When you do include the copy back operations they will act as synchronization.

The other option is to use the Nvidia visual profiler or nvprof, they will give the execution time of each cudaMemcpy and kernel.

JohnDoes · December 11, 2013, 1:05pm

Thank you very much!! You are so right, thanks!!!

Topic		Replies	Views
cudaMemcpyDeviceToHost taking much time? CUDA Programming and Performance	3	2653	July 15, 2009
Is there any way to copy data from device to host more efficiently in this case? CUDA Programming and Performance	4	902	December 14, 2018
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15182	April 29, 2014
Why cudaMemcpyDeviceToHost is too slowly? CUDA Programming and Performance	1	605	November 16, 2021
Problem with CudaMemcpy CUDA Programming and Performance	1	693	March 18, 2014
Why is there the difference of memory copy speed between cpu>gpu and gpu>cpu CUDA Programming and Performance	3	1278	April 10, 2014
Can anyone explain the difference in time? CUDA Programming and Performance	2	2455	November 21, 2008
cudaMemcpy timing CUDA Programming and Performance	1	6778	December 8, 2010
Device to Host memcpy How do i make this faster? CUDA Programming and Performance	2	2510	February 6, 2008
cudaMemcpyDeviceToHost 200 x longer than cudaMemcpyHostToDevice ? CUDA Programming and Performance	2	1476	November 25, 2011

Copy back to host lasts much longer than copy to device, why?

Related topics