How to tune the SM utilization (across the entire GPU) of a CUDA kernel?

I want to try experiments with CUDA MPS (Multi-Process Service). I want to launch many CUDA kernels on the GPU simultaneously (concurrently), such that each kernel without concurrent execution with the other kernels has almost the same end-to-end running time.

For example, I have a kernel, KernA. And I would like to launch many instances of this kernel KernA on the GPU simultaneously. So, I want to manually control (even hard-coding it into the .cu program is fine) the SM utilization of each instance. If I launch two kernel instances of KernA simultaneously, I want each instance alone to have an SM utilization of 50%. [The input size does not matter to me. All I want is a kernel launch, where I can tune the SM utilization of each of them]. So that when two of them are run simultaneously on the GPU using CUDA MPS, 50% of each would roughly make up 100% of the GPU, and hence, time to run each instance alone would take the same as running those 2 kernels.

Similarly, if I want to launch 3 kernels simultaneously, I want each instance to have an SM utilization of around 33% so that 3 of these kernels would make up a total of 100% of the GPU.

Also, if I want to launch 4 kernels simultaneously, I want each instance to have an SM utilization of around 25% so that 4 of these kernels would make up a total of 100%.

Going by this line, if I want to launch n kernels simultaneously, I want each instance to have an SM utilization of around (100/n) % such that n of these kernels would make up a total of 100%.


The first approach which I tried to take:

I tried using the CUDA runtime API : cudaOccupancyMaxActiveBlocksPerMultiprocessor. I tried out the program example given here.

But, the problem I get in this above approach is as follows ( I am using RTX2080 GPU):

  • I tried to take the block size as input here. For block size = 32, I am getting 50% utilization per SM.
  • I half the block size like 16, 8, 4, 2, 1… The utilization per SM halves as well, i.e., 25, 12.5, ...
  • Warp size of the GPU that I am using is 32. If any, I use block size < warp size; as far as I know, the GPU system shall pad dummy threads to make up a total of 32 threads for a warp so that these total of 32 threads can work in a lock-step manner. Due to this padding, I guess I cannot launch more than two kernels concurrently (say with block size = 16 and 4 concurrent kernels) and expect them to get nicely packed in the GPU, as the example I mentioned above.
  • The other parameter which I can twig is the number of blocks. The API mentioned above sets the maximum active block per SM. So, if for a kernel, the API designates the numBlocks variable to say 16 and reports Occupancy: 50%, we cannot just set numBlocks=8 and launch the kernel (i.e., kernA<<<8, 32>>>()) and expect 25% utilization of the entire GPU because my RTX2080 has 46 streaming multiprocessors (SMs). Why? Since the API adjusts numBlocks for a single SM, the expected GPU utilization over all the SMs shall be 25/46 %.

The second approach which I tried to take is, keeping blockSize fixed; I am changing the value of n and hence causing numBlocks = n/ blockSize to adjust manually by hit-and-trial and checking the instantaneous change in the GPU utilization through the nvidia-smi command. But I do not find this approach quite concrete. And for smaller kernels, the execution time is so small that the kernel finishes execution without showing any change in the GPU % utilization in the watch -n 0.1 nvidia-smi command.


I am using this simple kernel for testing:

__global__ void MyKernel(float* a, float* b, float* c, int n)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n)
    {
        c[idx] = a[idx] + b[idx];
    }
}


Also, any specific way to confirm that my kernel, while executing, is making use of x% of the GPU apart from using watch -n 0.1 nvidia-smi command? Any profiling tool, for example, ncu, and how to check that?

@Robert_Crovella It is me who had asked the same question on StackOverflow. :(

Yes, it’s not for your benefit that I added that. Speaking for myself, and possibly others, before I spend time trying to answer a question, if it has been posted in multiple places, I would like to know what has been said in those places, before spending time on it myself. Also, for the benefit of others who a year from now may read your question, it may be helpful to know that it is posted elsewhere.

It’s not for your benefit. I fully expect that you know the places that you have posted it.

1 Like

Posting the same question in multiple places is know as cross posting. And since time immemorial (OK, the 1980s :-) it has been considered good form for askers to indicate where else they have posted the same question.

This avoids redundant efforts, as there is no point in crafting an answer to a question that has been satisfactorily answered elsewhere. As an answerer, it can be quite frustrating to find out after the fact.

1 Like