Hello, I am considering about the way how GPU shares when multiple process launch the kernel at the same time when GPU is using the Multi-process service. Is it time sharing or spatial sharing?
For example, If a program needs 25% resource of GPU and last for 5 minutes for GPU processing, another program needs 30% resource of GPU and last for 7 minutes for GPU processing. If I run these two programs at the same time using the same GPU with multi-process service. What will be the situation of GPU resource utilization and how long will last until both of two programs finish?
Hi, txbob. I am still confused about the case. What if the kernels are from totally different processes. Can they be able to run concurrently in the MPS case?
Also, I am confused about the meaning of “concurrently”, Is it time sharing or spatial sharing of the GPU. Can they run separately on GPU at the same time? Or it’s just that the GPU executes them in a round robin strategy which seems like that they run together?
Yes, that is the whole point of MPS. However, in order for the kernels to run concurrently, they must have relatively low resource utilization (e.g. blocks, threads, registers, shared mem, etc.) so that they can be co-resident.
In the non-MPS case, but in default compute mode, the GPU will practice time sharing, executing the requested kernels from different processes to completion, in round-robin fashion. First kernel A from process A will execute to completion, then kernel B from process B will execute to completion, and so on. I don’t usually apply the words “concurrent kernels” in this case, but at a much larger scope we could say that the two applications/processes are running “concurrently” e.g. from the point of view of a utility like top.
In the MPS case, the kernels can be witnessed to run concurrently. That is (again, subject to the limits/requirements indicated above) the kernels will be using execution resources from the GPU at the same instant, or in the same clock cycle. Subject to the same limitations, the cuda concurrentKernels sample code behavior (which executes from within a single process) could equivalently happen from separate/multiple processes, if MPS were in effect. (I’m not however suggesting that sample code is already set up to demonstrate this. I’m using it as a conceptual example of desired behavior.)
You can confirm much of this with careful profiling and execution experiments. Here is an example of one such “experiment”:
The experiment works for me. Why its not working for you, I can’t say. You’ve given essentially zero information about what your test case was.
Yes, it is the case that (ordinarily) kernels from separate processes (in separate contexts) cannot run concurrently. The “magic” of MPS is that it acts as a funnel, accepting work (CUDA requests) from separate processes, and running them in the same context on the device. This allows kernels from separate processes to run concurrently, because they are in fact running in the same context on the device. That context is owned and managed by the MPS daemon.
I’ve given about as much information as I have on the subject. If you have further objections or simply don’t believe me, I won’t be able to respond further. This information that I am giving here is available from other sources if you want to do research. For example I suggest you start by carefully reading the MPS documentation:
I have an application where we use cuvid/cuda to decode video (using libavcodec master branch) and then transfer the video to gl for rendering. in the end we want to transfer it back to cuda to encode with cuvid. i.e. provide a decode-render-encode pipeline for high-performance server-side video processing.
I realized that while the current pipeline only utilizes about 20% gpu (reported by nvidia-smi) it does not parallelize, i.e. running two instances of the pipeline makes ach run at halve speed. It seems mostly the gl rendering pipelines.
I was hoping for mps to solve this, but the interop seems to be broken there, i get “operation not supported” when issuing cuGraphicsGLRegisterImage.
Is there any way around this? Is there any hope gl draw calls will benefit from mps? Will this change with pascal/volta? (i think we are running maxwell now).