This might sound a very basic question but has bothered me for a while. So when I run a cuda program on my Nvidia GPU laptop, the computer graphics rendering seems just fine while the cuda program is running. But don’t they both need to use the GPU? as far as I know the GPUS do not run an operating system and switch between processes etc. So I can’t figure out how the GPU is able to run my cuda program without pausing the screen graphics rendering.
the GPU does context switching using various strategies, some are opportunistic, some are preemptive.
In the preemptive case, for example, it will switch in the graphics context, do some graphics work for a short period of time, then switch out (write data and state out to memory) the graphics context, and switch in the compute context.
Conceptually there are similarities to the way a CPU core might switch from one thread of execution to another.
Thanks for the enlightment! If multiple processes on the CPU luanch their own GPU kernels, does the GPU execute these kernels sequentially (only execute the next one when the current is finished), or it context switches between the kernels? Context switch in the sense that the whole execution context is stored in the global memory and all caches are invaliated etc.
Typically, using the runtime API, each CPU process will have its own context. The kernel execution in one context will not overlap the kernel execution in another context, unless you use MPS. There are many questions that cover this particular topic, if you care to search for them.
Thanks again for the help! I have searched the question on both the forum and stackoverflow, and did find many related discussions, is possible to confirm the following understanding:?
All kernels will be serialized on the GPU no matter if they are launched by the same or diffrent CPU processes. So no context switching normally between multiple kernels. However, with MPS, it’s possible to make kernels run concurrently on the GPU, while the context switching details are not specified.
For future readers, so related discussions I found:
Kernels in the same CPU process (i.e. the same context) have the ability to run overlapped, not serialized. This is a fairly involved topic, that many folks ask about (why don’t I witness concurrent kernels?) There is a CUDA sample code that demonstrates some the needed conditions (e.g. stream usage) for kernels to overlap or run concurrently, in the same context.
Ah, thanks a lot for the further clarification!