Question about GPU sharing of Multi-process service

Hello, I am considering about the way how GPU shares when multiple process launch the kernel at the same time when GPU is using the Multi-process service. Is it time sharing or spatial sharing?

For example, If a program needs 25% resource of GPU and last for 5 minutes for GPU processing, another program needs 30% resource of GPU and last for 7 minutes for GPU processing. If I run these two programs at the same time using the same GPU with multi-process service. What will be the situation of GPU resource utilization and how long will last until both of two programs finish?

MPS essentially converts the kernel launches to a scenario as if they were launched from the same process.

So your questions are answered by considering the concurrent kernels case.

If your kernels can run concurrently in a single-process case, they should be able to run concurrently in the MPS case.

Hi, txbob. I am still confused about the case. What if the kernels are from totally different processes. Can they be able to run concurrently in the MPS case?

Also, I am confused about the meaning of “concurrently”, Is it time sharing or spatial sharing of the GPU. Can they run separately on GPU at the same time? Or it’s just that the GPU executes them in a round robin strategy which seems like that they run together?

Yes, that is the whole point of MPS. However, in order for the kernels to run concurrently, they must have relatively low resource utilization (e.g. blocks, threads, registers, shared mem, etc.) so that they can be co-resident.

In the non-MPS case, but in default compute mode, the GPU will practice time sharing, executing the requested kernels from different processes to completion, in round-robin fashion. First kernel A from process A will execute to completion, then kernel B from process B will execute to completion, and so on. I don’t usually apply the words “concurrent kernels” in this case, but at a much larger scope we could say that the two applications/processes are running “concurrently” e.g. from the point of view of a utility like top.

In the MPS case, the kernels can be witnessed to run concurrently. That is (again, subject to the limits/requirements indicated above) the kernels will be using execution resources from the GPU at the same instant, or in the same clock cycle. Subject to the same limitations, the cuda concurrentKernels sample code behavior (which executes from within a single process) could equivalently happen from separate/multiple processes, if MPS were in effect. (I’m not however suggesting that sample code is already set up to demonstrate this. I’m using it as a conceptual example of desired behavior.)

You can confirm much of this with careful profiling and execution experiments. Here is an example of one such “experiment”:

Hi, txbob, Thanks for your detailed reply. I wonder whether the MPS is an open-source architecture? I want to read its source code but cannot find anywhere.

In fact, I have run the experiment in too.

But the result I got is different from there. It also shows like this even if I used MPS case.

kernel duration: 6.409399s
kernel duration: 12.078304s

Hi, txbob, I noticed that in the programming guide, It says that A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context. So, if these kernels from totally different processes, I think they should also from the different CUDA contexts. So, can they still be executed concurrently?

MPS is not open source

The experiment works for me. Why its not working for you, I can’t say. You’ve given essentially zero information about what your test case was.

Yes, it is the case that (ordinarily) kernels from separate processes (in separate contexts) cannot run concurrently. The “magic” of MPS is that it acts as a funnel, accepting work (CUDA requests) from separate processes, and running them in the same context on the device. This allows kernels from separate processes to run concurrently, because they are in fact running in the same context on the device. That context is owned and managed by the MPS daemon.

I’ve given about as much information as I have on the subject. If you have further objections or simply don’t believe me, I won’t be able to respond further. This information that I am giving here is available from other sources if you want to do research. For example I suggest you start by carefully reading the MPS documentation:

Especially sections 1.3 and 2.1

As one example, note this:

"MPS allows kernel and memcopy operations from different
processes to overlap on the GPU, "

overlap means concurrent. There would be no point in making such a statement otherwise, because non-overlapping kernel execution can be achieved without MPS, ie. round-robin servicing.

Hi, txbob, thanks for your reply which helps me a lot.

I have a follow-up question.

I have an application where we use cuvid/cuda to decode video (using libavcodec master branch) and then transfer the video to gl for rendering. in the end we want to transfer it back to cuda to encode with cuvid. i.e. provide a decode-render-encode pipeline for high-performance server-side video processing.

I realized that while the current pipeline only utilizes about 20% gpu (reported by nvidia-smi) it does not parallelize, i.e. running two instances of the pipeline makes ach run at halve speed. It seems mostly the gl rendering pipelines.

I was hoping for mps to solve this, but the interop seems to be broken there, i get “operation not supported” when issuing cuGraphicsGLRegisterImage.

Is there any way around this? Is there any hope gl draw calls will benefit from mps? Will this change with pascal/volta? (i think we are running maxwell now).