Hi
Is there any way to analyze the concurrency of kernels on a gpu device?
When I look at the kernel’s time stamp, it seems that the device first runs KERNEL1 and then KERNEL2. So that means KERNEL1 takes all SMs and then KERNEL2 takes all SMs.
I would like to know with the MPS feature turned on, is there a way to check which kernels are offloaded to the device at a given time? I mean, in the final report, I see that the launch time of KERNEL1 and KERNEL2 are 1.32243 for example.
I haven’t found an answer for that. Any thought?
nsight systems has the ability to do MPI (and MPS) profiling, and AFAIK when doing so, you should be able to observe concurrency, if it is happening. The timelines across processes should be harmonized.
The “launch” of a kernel, as observed at the API level, does not necessarily indicate the beginning of the execution of that kernel, from a device perspective. So if it were me I would use a profiler like nsight systems to gain insight here, from the timeline view.
Based on the documents, the MPS deals with MPI processes. Is it possible to control the number of SMs prior to launching the kernel?
I assume you mean limit residency of your kernel blocks to a certain number of SMs. No, there are no mechanisms that CUDA provides to do that. However MPS provides a client/resource partitioning capability. Please read the MPS doc.
There are some questions as I read the manual.
1- It seems that the variable to control the number of clients is CUDA_MPS_ACTIVE_THREAD_PERCENTAGE. So, if I have two MPI processes, I have to set that variable to 50. Is that correct?
2- I read your answers here and here. It seems that MPS works with multiple processes offloaded on GPU, e.g two processes each has one kernel. The question is, what about one process with two kernels? For example, a machine learning program has one python process with multiple kernels running on GPU. Is MPS beneficial in this case?