I came upon the programming guide section that states that when calling a global function control is returned to the CPU process before the device is finished. Calling cudaThreadSynchronize() for the runtime API and cuCtxSynchronize() for the driver API waits for execution to finish.
How is this possible? How can my program continue if the results arent done yet? Is the execution guaranteed to have finished all threads and just some shutdown process is still running?