I tried to copy data from host to device exactly after I started a kernel on the device to test asynchronous kernel launch functionality.
Here is the code:
... myKernel<<<...>>>(...); // cudaThreadSynchronize(); cudaMemcpy(....); ...
I did one run with and one without “cudaThreadSynchronize()”. I hoped to decrease the total running time of my program when I comment “cudaThreadSynchronize()” out but there was no change to the run time.
I asume that “cudaMemcpy()” does not work with asynchronous kernel launches. Am I right? I would only save time if I would compute other things on the host exactly after a asynchronous kernel launch, right?
Thanks in advance for your answers.