Fine grained Kernel scheduling with MPS

Hi,
I am working on using MPS for running multiple TensorFlow inference jobs using the same GPU. For my use-case, when GPU is being shared by more than one processes, I sometimes need to prioritize execution of kernels of one process over the other. Is this supported?

To explain the problem in more detail, consider an example in which we have two processes, p1 and p2 (each with just one kernel execution stream) sharing a GPU.

Scenario: When there are one or more kernels in ready queue for both p1 and p2.
Default MPS behavior (My understanding):
If there is enough resources, execute multiple kernels at the same time from both p1 and p2.

Desired behavior:
Ability to decide based on priority if:

  1. Execute kernel of p1 first then p2.
  2. Execute kernel of p2 first then p1.
  3. Incase there is enough resources, execute multiple kernels at the same time from both p1 and p2.

If this kind of customized scheduling is not supported, It will be great if someone can guide what code changes will be needed to make it work.

Thanks!

MPS doesn’t support priority assignments to kernels or processes. MPS allows a limited amount of resource reservation, for that I suggest reading the MPS doc or this.

1 Like

Thanks for the response!

Is MPS open sourced? Can I modify MPS to add this functionality?
In case it’s not open sourced, is there a way to add a wrapper on top to enable this functionality?

No.

No.

Nothing comes to mind. Others may have ideas.

1 Like

Does stream priority work across multiple processes? CUDA Runtime API :: CUDA Toolkit Documentation

No. That priority is only applied within a process.

1 Like

I have another follow up question. If a process, p1 is running on a 4-GPU machine and is using all the 4 GPUs, but I want to limit the thread usage i.e. CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for p1 for one of the GPUs instead of all of the GPUs. Is this supported ?

In my recent test, I had two processes, with Process 1 featuring two separate threads using different CUDA streams, one with a high priority and the other with a low priority. Process 2, on the other hand, had just one thread with a low-priority CUDA stream. Both processes were started concurrently with MPS enabled. After analyzing data from nsys, I observed that the high-priority CUDA stream not only had an effect on the low-priority CUDA stream within the same process, but it also impacted the CUDA stream performance in Process 2.

Therefore, my answer to the query, “Does stream priority work across multiple processes?” is YES. I would appreciate it if you could verify my findings with your internal team. Thank you.

I would expect any activity in process 1 to affect activity in process 2 in a MPS setting. AFAIK the interprocess behavior is unspecified (apart from what you can find in the MPS doc) and I’m not going to post material non-public information here. That is not how it works, nor is that my function or purpose on these forums.

You can always request an update to CUDA documentation by filing a bug. Instructions are linked to a sticky post at the top of this sub-forum.