A question about kernel execution


My question is a quite simple but I don’t have the answer. In the NVIDIA CUDA Programming Guide is written that a kernel has an asynchronous call, this mean that a kernel immediately return to the host when it is called. I want to know if there is some certainty that a kernel does not overlap the execution of another kernel or there’s need to call cudaThreadSyncrhonize to synchronise the execution.


I have another question about this matter. Typically, after a kernel execution, we need to copy data back from the GPU to CPU, so, there’s no need to use cudaThreadSynchronize() before use a cudaMemcpy() ?

Thanks in advance.