I’m not able to assess your knowledge level. If we start from first principles this could be a fairly lengthy topic. What I said was the applications can run at the same time, but the kernels will serialize.
A CUDA application consists of host code and device code. The device code roughly speaking can be referred to as a set of kernel launches which run asynchronously from the host code. Therefore lets imagine we had an application that ran for about 1 second, then launched a kernel for 1 second, then did some further host processing for 1 second, then launched a kernel for 1 second, then did 1 second of host processing, then exits.
Please forgive the ascii art. The application timeline might look like this:
KKKK KKKK (kernel)
Now suppose we launched 2 of these applications, in two different processes (e.g. two different command prompts/terminals), on a machine with a single GPU. Let’s also suppose that MPS is not in view here. In CUDA, by definition each application/process that uses a GPU will create its own context. A context can be thought of similarly to how you may think of a process space in the CPU/host system. Each host process will have a separate device context associated with it.
Coming back to our timeline, the processes might look like this:
P1: AAAAAAAAAAAAAAAAAAAA <------------------------------------------------|
KKKK KKKK <--| the applications run at the "same time"
the kernels serialize |
P2: AAAAAAAAAAAAAAAAAAAAAAAA | <------------------------------|
KKKK KKKK <--|
Those two applications were started at the same time. My claim is from the user perspective (and from the standpoint of monitoring utilities such as top) they appear to be running at the same time. However, under the hood, the kernel launches do not run at the same time, they serialize. And this serialization will have some effect on the overall timeline (for example, in the case above, one of the 2 applications runs for 1 second longer than the other.)
So what does MPS do? In a nutshell, MPS acts as an intermediary/proxy between user applications and the GPU resources. MPS “funnels” all user application activity effectively into a single GPU context. This may not have much if any effect on the overall double-timeline above, because many, probably most CUDA kernels are written in such a way that they fully occupy the GPU anyway. If a kernel fully occupies a GPU, there really isn’t any opportunity for kernel concurrency even with MPS. MPS only makes an obvious difference when the resource utilization of the kernels in question is so small that concurrency is possible. Then, in that scenario, MPS will enable kernel concurrency that would not normally have been possible with independent user processes.