A general question on Cuda Sync after kernal call

Hello,
I have a general question on cuda sync , say let’s use sample code 0_Introduction/clock as example.
As kernelI “timedReduction” is asynchronous, how the device to host copy on 3rd line know the kernel call is complete and it’s safe to copy the result from device?
Thank you!

cudaMemcpy(dinput, input, sizeof(float) * NUM_THREADS * 2, cudaMemcpyHostToDevice);
timedReduction<<<NUM_BLOCKS, NUM_THREADS, sizeof(float) * 2 * NUM_THREADS>>>( dinput, doutput, dtimer);
cudaMemcpy(timer, dtimer, sizeof(clock_t) * NUM_BLOCKS * 2, cudaMemcpyDeviceToHost);

cudaMemcpy is a synchronous API function that waits until the GPU has finished previously submitted work. Under the hood the CUDA driver simply places each kernel launch into a queue and receives notification back from the GPU as each kernel completes.

To achieve high performance of CUDA-accelerated apps, you would want to avoid synchronous API calls; use cudaMemcpyAsync as much as possible.

Thank you for you answer!