Problem with CudaMemcpy

Sakthi · March 18, 2014, 10:09pm

Hi, I’ve been running some tests with large arrays of floating point values.
I launched my application and tried to estimate the time it takes for execution at different stages. However, I’m baffled at one point. My code contains two kernels and no problem is encountered in either of the kernels. However, when I’m trying to copy the results of the second kernel to a host variable, cudaMemcpy() seems like taking forever. If I substitute the array with very small number of elements, no problem is encountered and I’m getting my desired results.
So,
Why is the call to cudaMemcpy behave like this? Is there any limits are problems or am I missing something?

Robert_Crovella · March 18, 2014, 10:34pm

You might be using host based timing methods and not cudaEvent(s) for timing, and getting confused because the kernel launches are asynchronous.

If I have code like this:

kernel<<<…>>>(…);
cudaMemcpy(…);

The kernel launch returns immediately to the host code, before the kernel has completed execution. So if you attempt to time things like this:

gettimeofday(t1,…);
kernel<<<…>>>(…);
gettimeofday(t2,…);
cudaMemcpy(…);
gettimeofday(t3,…);

The t2-t1 time will always be short, regardless of the kernel execution time, because it is only measuring some kind of “overhead” to launch the kernel.
The t3-t2 time will end up showing the time required to execute the kernel (for the most part) plus the time to copy the data. cudaMemcpy blocks until the previous kernel activity is complete. Then it executes the copy operation.

You could get more sensible results by inserting a cudaDeviceSynchronize() immediately after the kernel call (before the t2 timing step), or else using cudaEvent system for timing.

Topic		Replies	Views
cudaMemcpy takes large time? CUDA Programming and Performance	2	1963	June 8, 2009
About CUDA CUDA Programming and Performance	2	4729	December 3, 2008
cudaMemcpy(dataDev, dataHost, mem_size, cudaMemcpyHostToDevice) execution time to long CUDA Programming and Performance	2	6423	January 21, 2010
cudaMemcpy timing CUDA Programming and Performance	1	6796	December 8, 2010
How much time is cudaMemcpy() use? CUDA Programming and Performance	1	4031	July 30, 2008
Getting diff time statistics for same function Totally confused after seeing results CUDA Programming and Performance	3	4212	December 4, 2007
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25161	March 8, 2010
Inconsistent cudaMemcpy Timing cudaMemcpy and kernel timing hiccups at 1 second intervals CUDA Programming and Performance	1	1107	October 6, 2010
Kernel Time Execution CUDA Programming and Performance	3	1728	June 5, 2011
cudaMemcpy time CUDA Programming and Performance	1	2978	July 12, 2011

Problem with CudaMemcpy

Related topics