I am reading the MPS documentation released by NVIDIA and it reports that:
Each MPS client process has fully isolated address space and therefore each client context allocates independent contest storage and scheduling resources. In addition it reports that those resources scale with the amount of threads available to the client.
However, by default each MPS client has all available threads useable (set_active_thread_percentage 100).
And my question is:
How the clients-processes work with the default MPS server since every process has all the resources (at least the threads and consequently the hardware resources like cores and SMs) for itself? How can they run concurrently?
They run concurrently in the same way that kernels launched from a single process run concurrently. That is really the fundamental concept and mechanism that is called MPS.
All resources are shared.
A kernel has an execution model that allows it to potentially fill the entire machine. If other kernels are executing already, then only the remaining resources can be used to support the launch/execution of a new kernel. If those resources are not sufficient to begin execution of the kernel, then the kernel will wait in a queue until necessary resources free up.
Memory allocations are a special case. Memory allocations occur just as if they emanated from a single process. Therefore, ignoring unified memory oversubscription, if the sum total of requests from cudaMalloc operations from all processes exceeds the available GPU memory, the memory allocation request that exceeds that amount will fail. Again, the concept here is that the resources are shared.
So in the default mode of MPS, let’s say I will run 2 different processes with kernels.
That means that every kernel will use the resources that would use if it was running alone. The fact that they indeed run concurrently is that the foul resources that the kernel needs is less that the full resources of the GPU and therefore there are free resources for the other kernel too.
If I use set_active_thread_percentage 50, the kernel will just use the half of the resources that would use if it was running alone and therefore I reduce the possibilities of fulling the resources of the whole GPU.
Am I correct in this? I would like to know if I have understood it correctly.
Because I was expecting that when I run the 2 processes, the gpu resources would be shared in the half independently from what resources would use the one process.
Additionally I observed that the only control of the resources from MPS control is the percentage of the active threads. Is this the only parameter that effect the gpu resources? (apart from the global memory)
The low level details of block scheduling are largely unpublished. I wouldn’t be able to precisely describe the behavior in a general case of “two processes launching kernels”. However, in my view, it is certainly possible that the observed execution behavior (e.g. the block scheduling order, and the SMs they are deposited on, as well as the observed throughput) could vary quite a bit between the case where kernels are running concurrently and their behavior “if it was running alone”. For example, suppose I have kernel 1 from process A and kernel 2 from process B. Kernel 1 includes 1000 blocks. Kernel 2 includes 1000 blocks. If I launch them at the same time, it’s possible that Kernel 2 could occupy all but one of the SMs on a GPU, and the blocks of Kernel 1 could be scheduled strictly on a single SM. And there are an infinite number of other possible scenarios, given the level of description in the general case you gave.
I will probably refrain from answering any further questions in this vein. I would encourage you to recast your thinking into the idea that the GPU is a throughput machine. You give it some work to do, and it will proceed through that work at the fastest possible rate. In my opinion, there is not enough published information in CUDA documentation to go substantially beyond that with a high degree of precision.
Yes, you can restrict how many threads are available for use by a particular client, or server, in this fashion.
Control over the active thread percentage (per client, or per server) is the only tool I am aware of at this time to manage access between clients.
All configuration settings available to you are documented in the MPS doc.
Ok, I would like to ask you a final clarification.
When you reference to the clients, you mean the different processes that will run concurenlty right?
I observed many times this term in the mps doc (the doc you posted too) but I did not find the meaning of this term clear.
So with clients we mean the processes we will run concurrently or the users that will run the MPS daemon?
Yes, the doc is clear in that there will only be one user running the MPS daemon at a time. If another user tries to start the MPS daemon, the already running one will be killed and a new one is spawned.
I’m currently experimenting with the CUDA MPS server.
I have 2 questions:
Can I launch the CUDA MPS server without having root privileges?
Why would I need to login as root to be able to start the CUDA MPS server?
If I have 2 jobs sharing the same GPU, with job1 using the MPS server and with job2 not using the MPS server. I want to give a higher priority to job2 over job1 in running its kernel on the GPU. For instance, say if one process from job1 is trying to access the GPU at the same time as another process from job2, I want the process from job2 to be given the higher priority to access the GPU first. Is there a way to handle these type of cases?
Also, at the same link, the expected setup is to set exclusive process mode. This requires root privilege also.
I don’t know why. If you’d like to see a change to any aspect of CUDA behavior, please file a bug, the instructions are linked in a sticky post at the top of this sub-forum.
I don’t think that is going to work. Proper usage of MPS requires the GPU be in exclusive process mode. Refer to the same link I previously linked. Exclusive process mode means only one user process can access the GPU (that would be the MPS server) and other processes/jobs have no access to it.
MPS does provide some mechanisms to manage resource sharing among processes. Please read the above linked MPS doc in its entirety for further information.
If CUDA MPS driver requires exclusive access to GPU then I assume using CUDA streams (one stream per thread) appear to be more appealing in this case because different CUDA streams from job1 and different CUDA streams from job2 could presumably co-exit and use the same GPU. Is that correct?
All CUDA code is using streams. Period. One CUDA code is not differentiated from another CUDA code in that one is using streams and one isn’t.
There’s not enough information to make a determination which would be better.
Could job 1 and job 2 share the same GPU if only one was using CUDA MPS? No I don’t think so.
Could job 1 and job 2 share the same GPU if both are using MPS and both are originating from the same user? Yes, I think so.
Notice I didn’t say anything about streams in either of those questions/answers.
MPS is substantially documented. There are numerous questions and answers already about it on various forums. I can’t tell you which of two things would be better based on a few sentences description. But if you actually implement things and try out things with MPS, I think you will learn a lot, and will be able to draw your own conclusions. Benchmarking is far better than advice on the internet, for several reasons. That’s just my opinion, of course, like nearly everything else I type.
I have another question regarding CUDA MPS. If there is no resource sharing between MPS client processes, i.e., MPS assigns a separate addressable space for each process on the Device memory, what happens if we create 48 contexts say (as is allowable on Volta) and we ran out of Device memory? Do I have to do my own memory management to prevent Device OOM?
I knew all that of course. Thank you for the clarification anyway. I think my initial source of confusion was due to my misunderstanding of the environment variable CUDA_MPS_ACTIVE_THREAD_PERCENTAGE. I guess it is all clear now. Thank you. Ali
I have implemented CUDA MPS in our product and ran a number of benchmarks. I have also implemented an equivalent code using CUDA streams using threads. My conclusion was that CUDA MPS was relatively faster but required much more GPU device memory than its CUDA streams counterpart. Based on my reading of NVIDIA docs, there should be approximately 300 MG for each MPS context but found that the memory requirements was way more than just that. My theory is that CUDA MPS does not share GPU buffers at runtime and thus the memory requirements are much higher.
I’d appreciate it very much If anyone can explain to me why CUDA MPS requires too much memory w.r.t to CUDA streams.
There isn’t anything shared between separate MPS processes (unless you build the “sharing” in yourself, such as CUDA IPC). So if process A does a cudaMalloc of 1GB, and process B does a cudaMalloc of 1GB, the GPU memory utilization just for those 2 ops will be 2GB.
Note that this is entirely consistent with how operating systems handle threads and processes: Resource ownership is per-process. Threads share the resources of the process they belong to.
Simply sharing resources between processes would be a hangar-door sized security hole in the operating system. An OS may offer mechanisms that allow a few specific instances of resource sharing between processes in a limited and secure manner. For example, in Linux a parent process can share a socket with a child processes it forked.