cudaMemcpy during kernel execution asynchronous kernel launch

Hi all,

I tried to copy data from host to device exactly after I started a kernel on the device to test asynchronous kernel launch functionality.

Here is the code:



// cudaThreadSynchronize();



I did one run with and one without “cudaThreadSynchronize()”. I hoped to decrease the total running time of my program when I comment “cudaThreadSynchronize()” out but there was no change to the run time.

I asume that “cudaMemcpy()” does not work with asynchronous kernel launches. Am I right? I would only save time if I would compute other things on the host exactly after a asynchronous kernel launch, right?

Thanks in advance for your answers.

Best regards,


There won’t be any change in time - cudaMemcpy synchronizes with the GPU. The call to cudaThreadSynchronize is redundant.

As mentioned in other posts, the current hardware doesn’t support cudaMemcpy and kernel execution at the same time.

Memory transactions, GL/DX resource registration, and a few other CUDA calls are blocking (Programming Guide, section