MPS resource management

Hi guys,

I got some questions about the MPS resource mangement.

First of all, How does MPS allocate the GPU resources to mutliple clients?
Is this done by allocating specific SMs to the each kernel without overlapping?

If so, second question is how MPS allocate specific SMs to each kernel?
I found some information about TMD mask which indicating which SMs should be involved in kernel.
Does MPS utilize this TMD masking?


In the general case (without specifying per-client percentages) the general mental model should be the same as if the requests were issued from the same process, both for memory usage and compute utilization. SMs are not partitioned in the general case.

If you specify per-client resource partitioning percentages, then yes, the SMs are allocated to specific clients for use by any kernels launched by those clients. In the typical allocation scheme, specifying e.g. 20% for a client means that the client’s kernels cannot use more than 20% of the SMs on the GPU, so this is a more careful definition than “SMs are allocated to specific clients”. Note the statement from the linked doc:

Setting the limit does not reserve dedicated resources for any MPS client context. It simply limits how much resources can be used by a client context. Kernels launched from different MPS client contexts may execute on the same SM, depending on load-balancing.

I don’t know and I’m fairly confident that information is neither published nor specified by NVIDIA.