Problem with CudaMemcpy

Hi, I’ve been running some tests with large arrays of floating point values.
I launched my application and tried to estimate the time it takes for execution at different stages. However, I’m baffled at one point. My code contains two kernels and no problem is encountered in either of the kernels. However, when I’m trying to copy the results of the second kernel to a host variable, cudaMemcpy() seems like taking forever. If I substitute the array with very small number of elements, no problem is encountered and I’m getting my desired results.
Why is the call to cudaMemcpy behave like this? Is there any limits are problems or am I missing something?

You might be using host based timing methods and not cudaEvent(s) for timing, and getting confused because the kernel launches are asynchronous.

If I have code like this:


The kernel launch returns immediately to the host code, before the kernel has completed execution. So if you attempt to time things like this:


The t2-t1 time will always be short, regardless of the kernel execution time, because it is only measuring some kind of “overhead” to launch the kernel.
The t3-t2 time will end up showing the time required to execute the kernel (for the most part) plus the time to copy the data. cudaMemcpy blocks until the previous kernel activity is complete. Then it executes the copy operation.

You could get more sensible results by inserting a cudaDeviceSynchronize() immediately after the kernel call (before the t2 timing step), or else using cudaEvent system for timing.