I am curious that what’s the process status if the process calls CUDA code. Suppose a process composing 3 parts: part 1: CPU code; part 2: GPU code; part 3: CPU code. If the process is in the CUDA code, for the CPU part, how is the process status? Can this process be preempted, or the CPU is held by the process and waiting for the CUDA return?
Also, if I have two CUDA processes and use round robin to schedule them. How about if a process’s time slot is finished and it executes code in GPU? Does this process release the CPU or still hole the CPU? How the GPU notify that the GPU work is done? By interrupt? Thanks.
CUDA kernel launches are asynchronous. This means that the CPU thread initiating the kernel launch makes a call into a library which starts the GPU processing. This library routine returns control to the CPU thread before the kernel has actually begun executing. The CPU thread can continue processing your code at that point (any code you have written after the point of the kernel launch), while the GPU kernel is executing.
Two or more CPU processes can share a GPU in default compute mode, through a mechanism known as context-switching. A description of a GPU context is given in the programming guide:
It is, roughly speaking, the GPU state associated with a CPU process that is using the GPU. Two separate processes will usually have two separate contexts, if they are using the same GPU.
The detailed behavior of context switching is not specified anywhere that I know of, but a general rule is that while one (or more) kernel(s) is executing from a particular process, no kernels from any other processes may be executing. When the kernel(s) from the process finishes/terminates, then the GPU, may, at its unspecified discretion, choose to process additional work from the same process/context (e.g. more kernel launches, perhaps) or it may choose to context-switch and service work requests from other processes.
Again, I know of no concise, unified specification for GPU context-switching that answers detailed questions such as how and under what circumstances a context-switch will occur.
Normally, when using the CUDA runtime API, a GPU context is destroyed when the CPU process owning it terminates. Context destruction should result in automatic release of any resources (e.g. GPU memory allocations) still owned by that context.
Thanks for your reply. Can I understand that GPU work like a keyboard that if a kernel finishes and the host CPU process is in waiting (not has the CPU), the CPU raises an interrupt to wake the hosts CPU process up?
That sort of low-level description of how the hardware interacts with its driver is not documented anywhere that I know of.
From a programmer’s perspective, it should be sufficient for most cases I can imagine, simply to acknowledge that the GPU and driver have communication paths between them, and somehow the driver keeps track of the GPU state, and knows when to issue new work.
The only time a host CPU process would be waiting on the GPU is if it encountered a synchronization point, such as a call to cudaDeviceSynchronize() or cudaMemcpy(), to pick two possible examples. Somehow, the CPU thread “waits” on the GPU/GPU driver at these points, and somehow the driver allows the host thread to continue when ready. Since these points involve calls into the CUDA runtime library, I would think a sufficient mental model is that the relevant library routine does not return until the condition is satisfied.
The programmer has some control over the library and machine behavior at these thread-blocking points, whether it is a spin-wait type of behavior, or a yield behavior, via CUDA runtime API calls which modify this CPU thread blocking behavior: