How do CPU threads know that GPU kernel is finished?

A CPU thread doesn’t “know” when a kernel is finished. It can effectively ask the CUDA runtime if the GPU is finished by using a runtime API function such as cudaDeviceSynchronize, cudaStreamSynchronize, cudaMemcpy, cudaEventSynchronize, etc. Any of these calls may force the CPU thread to wait at the call until the GPU is finished/idle.