Question about CUDA MPS

Dear all,
I am reading the MPS documentation released by NVIDIA and it reports that:
Each MPS client process has fully isolated address space and therefore each client context allocates independent contest storage and scheduling resources. In addition it reports that those resources scale with the amount of threads available to the client.
However, by default each MPS client has all available threads useable (set_active_thread_percentage 100).
And my question is:
How the clients-processes work with the default MPS server since every process has all the resources (at least the threads and consequently the hardware resources like cores and SMs) for itself? How can they run concurrently?

Thank you in advance.

They run concurrently in the same way that kernels launched from a single process run concurrently. That is really the fundamental concept and mechanism that is called MPS.

All resources are shared.

A kernel has an execution model that allows it to potentially fill the entire machine. If other kernels are executing already, then only the remaining resources can be used to support the launch/execution of a new kernel. If those resources are not sufficient to begin execution of the kernel, then the kernel will wait in a queue until necessary resources free up.

Memory allocations are a special case. Memory allocations occur just as if they emanated from a single process. Therefore, ignoring unified memory oversubscription, if the sum total of requests from cudaMalloc operations from all processes exceeds the available GPU memory, the memory allocation request that exceeds that amount will fail. Again, the concept here is that the resources are shared.

Thank you for your quick reply!

So in the default mode of MPS, let’s say I will run 2 different processes with kernels.
That means that every kernel will use the resources that would use if it was running alone. The fact that they indeed run concurrently is that the foul resources that the kernel needs is less that the full resources of the GPU and therefore there are free resources for the other kernel too.
If I use set_active_thread_percentage 50, the kernel will just use the half of the resources that would use if it was running alone and therefore I reduce the possibilities of fulling the resources of the whole GPU.
Am I correct in this? I would like to know if I have understood it correctly.
Because I was expecting that when I run the 2 processes, the gpu resources would be shared in the half independently from what resources would use the one process.

Additionally I observed that the only control of the resources from MPS control is the percentage of the active threads. Is this the only parameter that effect the gpu resources? (apart from the global memory)

I would not phrase it that way.

The low level details of block scheduling are largely unpublished. I wouldn’t be able to precisely describe the behavior in a general case of “two processes launching kernels”. However, in my view, it is certainly possible that the observed execution behavior (e.g. the block scheduling order, and the SMs they are deposited on, as well as the observed throughput) could vary quite a bit between the case where kernels are running concurrently and their behavior “if it was running alone”. For example, suppose I have kernel 1 from process A and kernel 2 from process B. Kernel 1 includes 1000 blocks. Kernel 2 includes 1000 blocks. If I launch them at the same time, it’s possible that Kernel 2 could occupy all but one of the SMs on a GPU, and the blocks of Kernel 1 could be scheduled strictly on a single SM. And there are an infinite number of other possible scenarios, given the level of description in the general case you gave.

I will probably refrain from answering any further questions in this vein. I would encourage you to recast your thinking into the idea that the GPU is a throughput machine. You give it some work to do, and it will proceed through that work at the fastest possible rate. In my opinion, there is not enough published information in CUDA documentation to go substantially beyond that with a high degree of precision.

Yes, you can restrict how many threads are available for use by a particular client, or server, in this fashion.

Control over the active thread percentage (per client, or per server) is the only tool I am aware of at this time to manage access between clients.

All configuration settings available to you are documented in the MPS doc.

https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

Ok, I would like to ask you a final clarification.
When you reference to the clients, you mean the different processes that will run concurenlty right?
I observed many times this term in the mps doc (the doc you posted too) but I did not find the meaning of this term clear.
So with clients we mean the processes we will run concurrently or the users that will run the MPS daemon?

Yes, the doc is clear in that there will only be one user running the MPS daemon at a time. If another user tries to start the MPS daemon, the already running one will be killed and a new one is spawned.

The different processes that run concurrently by connecting to the MPS server are the clients. This pdf might help: http://on-demand.gputechconf.com/gtc/2015/presentation/S5584-Priyanka-Sah.pdf