cudaMemcpy during kernel execution asynchronous kernel launch

I tried to copy data from host to device exactly after I started a kernel on the device to test asynchronous kernel launch functionality.

// cudaThreadSynchronize();



I did one run with and one without “cudaThreadSynchronize()”. I hoped to decrease the total running time of my program when I comment “cudaThreadSynchronize()” out but there was no change to the run time.

I asume that “cudaMemcpy()” does not work with asynchronous kernel launches. Am I right? I would only save time if I would compute other things on the host exactly after a asynchronous kernel launch, right?

There won’t be any change in time - cudaMemcpy synchronizes with the GPU. The call to cudaThreadSynchronize is redundant.

As mentioned in other posts, the current hardware doesn’t support cudaMemcpy and kernel execution at the same time.

Memory transactions, GL/DX resource registration, and a few other CUDA calls are blocking (Programming Guide, section