I have a question about GPU MPS resource sharing.
When the total MPS quota of multiple processes is less than or equal to 100, different processes tend to use independent SMs, which is easy to understand. However, when the total MPS quota exceeds 100, meaning oversubscription occurs, what happens on the GPU hardware?
I noticed that the MPS documentation states: “Setting the limit does not reserve dedicated resources for any MPS client context. It simply limits how much resources can be used by a client context. Kernels launched from different MPS client contexts may execute on the same SM, depending on load-balancing.”
But I’m wondering, when oversubscription occurs, how are resources shared on the SMs that need to handle more than one processes? Do they time-share the SM, or are warps from different processes mixed and scheduled without distinction?
The general principle of MPS is that work from separate processes is treated as if it emanated from a single process. This is a mental model, not an exact description of behavior applicable to every possible related question (for example, isolation behavior may not be adequately described by that mental model).
Given that mental model, the work from two separate kernels issued from two separate processes under MPS can occupy the same SM at the same time, and there is not time-sharing or time-slicing. The behavior is roughly the same as concurrent kernel behavior, when the viewpoint is a single process.
I have a follow-up question. Does the size of a single kernel within a process affect the sharing between processes? For example, suppose an SM is processing blocks from Process A and B. If the blocks from Process A are very large and contain too many warps, could this prevent the warps from Process B from effectively sharing the SM’s resources in a spatial manner?
Yes, they could, and that is no different (i.e. is directly deducible) from the mental model involving concurrent kernels from a single process.
In a concurrent kernel scenario (single process) or in a MPS scenario (multi-process), if the kernel A has enough resource consumption, it may fully occupy an SM or even all the SMs of a GPU, thus preventing kernel B from being scheduled (i.e. deposited for execution) on that SM, or correspondingly, anywhere on the GPU, at that moment.
MPS is not a guarantee of co-scheduling or concurrency. Yes, with proper resource restrictions for the various client processes, MPS can more-or-less guarantee co-scheduling or concurrency, but you have basically pulled that scenario out of consideration with your question structure, in my view.