Does the kernel wait for all threads to end (/terminate?) before returning to the host? Or is it possible that the host code (after the kernel) resumes running while there are still threads active in the kernel?
For example if some threads get themselves into an infinate loop within the kernel (due to race conditions), is it possible that the host code can resume before the threads terminate.
It does make sense. And the answer is yes, you launch kernels asynchronously. When you fire off a kernel, the host function returns immediately, and you can chug away with your CPU or let it sleep or whatever while the GPU cooks away.
So the question is how the CPU can learn when the kernel has finished. The answer to that is usually by using the streams feature, inserting an event into a stream after a kernel, then checking the event via the CPU to see when the GPU has finished.
Many CUDA functions will implicitly block for you, like device mem copies, though there are asynchronous versions of those as well.
Look in the programming documentation to learn about streams, events, and async mem copies.