I came upon the programming guide section that states that when calling a global function control is returned to the CPU process before the device is finished. Calling cudaThreadSynchronize() for the runtime API and cuCtxSynchronize() for the driver API waits for execution to finish.
How is this possible? How can my program continue if the results arent done yet? Is the execution guaranteed to have finished all threads and just some shutdown process is still running?
Execution continues after the kernel launch to allow you to do other stuff on the CPU while you’re waiting. As soon as you try to readback the data to the CPU, it has to wait until the kernel has completed.