I want to try experiments with CUDA MPS (Multi-Process Service). I want to launch many CUDA kernels on the GPU simultaneously (concurrently), such that each kernel without concurrent execution with the other kernels has almost the same end-to-end running time.
For example, I have a kernel, KernA.
And I would like to launch many instances of this kernel KernA
on the GPU simultaneously. So, I want to manually control (even hard-coding it into the .cu
program is fine) the SM utilization of each instance. If I launch two kernel instances of KernA
simultaneously, I want each instance alone to have an SM utilization of 50%
. [The input size does not matter to me. All I want is a kernel launch, where I can tune the SM utilization of each of them]. So that when two of them are run simultaneously on the GPU using CUDA MPS, 50%
of each would roughly make up 100%
of the GPU, and hence, time to run each instance alone would take the same as running those 2
kernels.
Similarly, if I want to launch 3
kernels simultaneously, I want each instance to have an SM utilization of around 33%
so that 3 of these kernels would make up a total of 100%
of the GPU.
Also, if I want to launch 4
kernels simultaneously, I want each instance to have an SM utilization of around 25%
so that 4 of these kernels would make up a total of 100%
.
…
Going by this line, if I want to launch n
kernels simultaneously, I want each instance to have an SM utilization of around (100/n) %
such that n
of these kernels would make up a total of 100%
.
The first approach which I tried to take:
I tried using the CUDA runtime API : cudaOccupancyMaxActiveBlocksPerMultiprocessor. I tried out the program example given here.
But, the problem I get in this above approach is as follows ( I am using RTX2080
GPU):
- I tried to take the block size as input here. For block size = 32, I am getting
50%
utilization per SM. - I half the block size like
16, 8, 4, 2, 1
… The utilization per SM halves as well, i.e.,25, 12.5, ...
- Warp size of the GPU that I am using is
32
. If any, I use block size < warp size; as far as I know, the GPU system shall pad dummy threads to make up a total of 32 threads for a warp so that these total of32
threads can work in a lock-step manner. Due to this padding, I guess I cannot launch more than two kernels concurrently (say with block size = 16 and 4 concurrent kernels) and expect them to get nicely packed in the GPU, as the example I mentioned above. - The other parameter which I can twig is the number of blocks. The API mentioned above sets the maximum active block per SM. So, if for a kernel, the API designates the
numBlocks
variable to say16
and reportsOccupancy: 50%
, we cannot just setnumBlocks=8
and launch the kernel (i.e.,kernA<<<8, 32>>>()
) and expect25%
utilization of the entire GPU because my RTX2080 has 46 streaming multiprocessors (SMs). Why? Since the API adjustsnumBlocks
for a single SM, the expected GPU utilization over all the SMs shall be25/46 %.
The second approach which I tried to take is, keeping blockSize
fixed; I am changing the value of n
and hence causing numBlocks = n/ blockSize
to adjust manually by hit-and-trial and checking the instantaneous change in the GPU utilization through the nvidia-smi
command. But I do not find this approach quite concrete. And for smaller kernels, the execution time is so small that the kernel finishes execution without showing any change in the GPU % utilization in the watch -n 0.1 nvidia-smi
command.
I am using this simple kernel for testing:
__global__ void MyKernel(float* a, float* b, float* c, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n)
{
c[idx] = a[idx] + b[idx];
}
}
Also, any specific way to confirm that my kernel, while executing, is making use of x%
of the GPU apart from using watch -n 0.1 nvidia-smi
command? Any profiling tool, for example, ncu
, and how to check that?