I want to try experiments with CUDA MPS (Multi-Process Service). I want to launch many CUDA kernels on the GPU simultaneously (concurrently), such that each kernel without concurrent execution with the other kernels has almost the same end-to-end running time.
For example, I have a kernel, KernA. And I would like to launch many instances of this kernel KernA on the GPU simultaneously. So, I want to manually control (even hard-coding it into the .cu program is fine) the SM utilization of each instance. If I launch two kernel instances of KernA simultaneously, I want each instance alone to have an SM utilization of 50%. [The input size does not matter to me. All I want is a kernel launch, where I can tune the SM utilization of each of them]. So that when two of them are run simultaneously on the GPU using CUDA MPS, 50% of each would roughly make up 100% of the GPU, and hence, time to run each instance alone would take the same as running those 2 kernels.
Similarly, if I want to launch 3 kernels simultaneously, I want each instance to have an SM utilization of around 33% so that 3 of these kernels would make up a total of 100% of the GPU.
Also, if I want to launch 4 kernels simultaneously, I want each instance to have an SM utilization of around 25% so that 4 of these kernels would make up a total of 100%.
…
Going by this line, if I want to launch n kernels simultaneously, I want each instance to have an SM utilization of around (100/n) % such that n of these kernels would make up a total of 100%.
The first approach which I tried to take:
I tried using the CUDA runtime API : cudaOccupancyMaxActiveBlocksPerMultiprocessor. I tried out the program example given here.
But, the problem I get in this above approach is as follows ( I am using RTX2080 GPU):
- I tried to take the block size as input here. For block size = 32, I am getting
50%utilization per SM. - I half the block size like
16, 8, 4, 2, 1… The utilization per SM halves as well, i.e.,25, 12.5, ... - Warp size of the GPU that I am using is
32. If any, I use block size < warp size; as far as I know, the GPU system shall pad dummy threads to make up a total of 32 threads for a warp so that these total of32threads can work in a lock-step manner. Due to this padding, I guess I cannot launch more than two kernels concurrently (say with block size = 16 and 4 concurrent kernels) and expect them to get nicely packed in the GPU, as the example I mentioned above. - The other parameter which I can twig is the number of blocks. The API mentioned above sets the maximum active block per SM. So, if for a kernel, the API designates the
numBlocksvariable to say16and reportsOccupancy: 50%, we cannot just setnumBlocks=8and launch the kernel (i.e.,kernA<<<8, 32>>>()) and expect25%utilization of the entire GPU because my RTX2080 has 46 streaming multiprocessors (SMs). Why? Since the API adjustsnumBlocksfor a single SM, the expected GPU utilization over all the SMs shall be25/46 %.
The second approach which I tried to take is, keeping blockSize fixed; I am changing the value of n and hence causing numBlocks = n/ blockSize to adjust manually by hit-and-trial and checking the instantaneous change in the GPU utilization through the nvidia-smi command. But I do not find this approach quite concrete. And for smaller kernels, the execution time is so small that the kernel finishes execution without showing any change in the GPU % utilization in the watch -n 0.1 nvidia-smi command.
I am using this simple kernel for testing:
__global__ void MyKernel(float* a, float* b, float* c, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n)
{
c[idx] = a[idx] + b[idx];
}
}
Also, any specific way to confirm that my kernel, while executing, is making use of x% of the GPU apart from using watch -n 0.1 nvidia-smi command? Any profiling tool, for example, ncu, and how to check that?