Cocurrent execution with MPS

mahmood.nt · November 11, 2020, 4:53pm

Hi
Is there any way to analyze the concurrency of kernels on a gpu device?
When I look at the kernel’s time stamp, it seems that the device first runs KERNEL1 and then KERNEL2. So that means KERNEL1 takes all SMs and then KERNEL2 takes all SMs.
I would like to know with the MPS feature turned on, is there a way to check which kernels are offloaded to the device at a given time? I mean, in the final report, I see that the launch time of KERNEL1 and KERNEL2 are 1.32243 for example.

I haven’t found an answer for that. Any thought?

Robert_Crovella · November 11, 2020, 5:01pm

nsight systems has the ability to do MPI (and MPS) profiling, and AFAIK when doing so, you should be able to observe concurrency, if it is happening. The timelines across processes should be harmonized.

The “launch” of a kernel, as observed at the API level, does not necessarily indicate the beginning of the execution of that kernel, from a device perspective. So if it were me I would use a profiler like nsight systems to gain insight here, from the timeline view.

mahmood.nt · November 11, 2020, 5:07pm

Based on the documents, the MPS deals with MPI processes. Is it possible to control the number of SMs prior to launching the kernel?

Robert_Crovella · November 11, 2020, 5:37pm

I assume you mean limit residency of your kernel blocks to a certain number of SMs. No, there are no mechanisms that CUDA provides to do that. However MPS provides a client/resource partitioning capability. Please read the MPS doc.

mahmood.nt · November 11, 2020, 8:13pm

There are some questions as I read the manual.
1- It seems that the variable to control the number of clients is CUDA_MPS_ACTIVE_THREAD_PERCENTAGE. So, if I have two MPI processes, I have to set that variable to 50. Is that correct?

2- I read your answers here and here. It seems that MPS works with multiple processes offloaded on GPU, e.g two processes each has one kernel. The question is, what about one process with two kernels? For example, a machine learning program has one python process with multiple kernels running on GPU. Is MPS beneficial in this case?

Robert_Crovella · November 11, 2020, 8:41pm

that isn’t the only way to do it. That is one possible approach. Please read the entire doc carefully. If you just want to read about resource provisioning, please read section 2.3.5.2 and all sections referenced from it. For example, if you wanted to give one client 70% and the other client 30%, you could use the MPS control utility to set the per-client limit at 70%, and then use the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE env var to restrict one of your clients to 30%. Or you could just give them both access to 70% (which would guarantee at least 30% for either one). It depends on what you want to achieve.
I don’t think MPS provides any particular benefits to a single process. There is no way to do fine-grained resource provisioning within a single process that I know of at this time (ignoring A100 MIG mode), but you have some crude control over it with CUDA stream priorities.

Topic		Replies	Views
Question about CUDA MPS CUDA Programming and Performance	15	3159	August 22, 2022
cuda kernels from different process can run concurrently? same performance with MPS on and off? CUDA Programming and Performance	9	2275	May 3, 2018
MULTI-PROCESS SERVICE(MPS) has no effect CUDA Programming and Performance	3	882	October 16, 2018
What is the best way to partition the SM of a GPU? CUDA Programming and Performance hw , cuda , kernel	2	1338	August 17, 2023
Is default kernel execution concurrent? Or we have to enable MPS? CUDA Programming and Performance	8	527	May 3, 2023
Parallelization of kernels without MPS CUDA Programming and Performance	6	859	February 5, 2019
MPS: Limiting threads to different thresholds for multi-GPU processes CUDA Programming and Performance tensorflow , kernel , ubuntu , python , linux	1	758	October 27, 2021
Question about GPU sharing of Multi-process service CUDA Programming and Performance	9	6949	April 30, 2018
Multi kernel execution with multi-process using MPS CUDA Programming and Performance	0	54	July 17, 2024
MPS on Turing architecture (GeForce RTX 2080) for jobs from multiple users CUDA Programming and Performance	3	1397	September 6, 2019

Cocurrent execution with MPS

Related topics