CUDA kernel launches are asynchronous. This means that the CPU thread initiating the kernel launch makes a call into a library which starts the GPU processing. This library routine returns control to the CPU thread before the kernel has actually begun executing. The CPU thread can continue processing your code at that point (any code you have written after the point of the kernel launch), while the GPU kernel is executing.
Two or more CPU processes can share a GPU in default compute mode, through a mechanism known as context-switching. A description of a GPU context is given in the programming guide:
[url]Programming Guide :: CUDA Toolkit Documentation
It is, roughly speaking, the GPU state associated with a CPU process that is using the GPU. Two separate processes will usually have two separate contexts, if they are using the same GPU.
The detailed behavior of context switching is not specified anywhere that I know of, but a general rule is that while one (or more) kernel(s) is executing from a particular process, no kernels from any other processes may be executing. When the kernel(s) from the process finishes/terminates, then the GPU, may, at its unspecified discretion, choose to process additional work from the same process/context (e.g. more kernel launches, perhaps) or it may choose to context-switch and service work requests from other processes.
Again, I know of no concise, unified specification for GPU context-switching that answers detailed questions such as how and under what circumstances a context-switch will occur.
Normally, when using the CUDA runtime API, a GPU context is destroyed when the CPU process owning it terminates. Context destruction should result in automatic release of any resources (e.g. GPU memory allocations) still owned by that context.