Cocurrent execution with MPS

Is there any way to analyze the concurrency of kernels on a gpu device?
When I look at the kernel’s time stamp, it seems that the device first runs KERNEL1 and then KERNEL2. So that means KERNEL1 takes all SMs and then KERNEL2 takes all SMs.
I would like to know with the MPS feature turned on, is there a way to check which kernels are offloaded to the device at a given time? I mean, in the final report, I see that the launch time of KERNEL1 and KERNEL2 are 1.32243 for example.

I haven’t found an answer for that. Any thought?

nsight systems has the ability to do MPI (and MPS) profiling, and AFAIK when doing so, you should be able to observe concurrency, if it is happening. The timelines across processes should be harmonized.

The “launch” of a kernel, as observed at the API level, does not necessarily indicate the beginning of the execution of that kernel, from a device perspective. So if it were me I would use a profiler like nsight systems to gain insight here, from the timeline view.

Based on the documents, the MPS deals with MPI processes. Is it possible to control the number of SMs prior to launching the kernel?

I assume you mean limit residency of your kernel blocks to a certain number of SMs. No, there are no mechanisms that CUDA provides to do that. However MPS provides a client/resource partitioning capability. Please read the MPS doc.

There are some questions as I read the manual.
1- It seems that the variable to control the number of clients is CUDA_MPS_ACTIVE_THREAD_PERCENTAGE. So, if I have two MPI processes, I have to set that variable to 50. Is that correct?

2- I read your answers here and here. It seems that MPS works with multiple processes offloaded on GPU, e.g two processes each has one kernel. The question is, what about one process with two kernels? For example, a machine learning program has one python process with multiple kernels running on GPU. Is MPS beneficial in this case?

  1. that isn’t the only way to do it. That is one possible approach. Please read the entire doc carefully. If you just want to read about resource provisioning, please read section and all sections referenced from it. For example, if you wanted to give one client 70% and the other client 30%, you could use the MPS control utility to set the per-client limit at 70%, and then use the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE env var to restrict one of your clients to 30%. Or you could just give them both access to 70% (which would guarantee at least 30% for either one). It depends on what you want to achieve.

  2. I don’t think MPS provides any particular benefits to a single process. There is no way to do fine-grained resource provisioning within a single process that I know of at this time (ignoring A100 MIG mode), but you have some crude control over it with CUDA stream priorities.