Optimizing "small" kernels on GPUs in OpenACC

Hi,

In my code there is an unavoidable section that computes thousands of iterations on a small piece of data (2D). Using OpenACC to accelerate the code yields very poor results (slower on a V100 than even a 6-core CPU) due to kernel overhead and/or low GPU usage.
All the data is already resident on the GPU so it is not a transfer problem.
This is not a new issue as even my codes back in 2010 performed not-so-great when used with very small problems.

Since these small-data computations are required, are there any hints/suggestions you may have for speeding up the small kernels?

I already tried combining multiple kernels when possible, using async when possible, and collapsing all 2D loops.

I also tried running the small kernels on the CPU but it is inefficient since I am running with 1 MPI rank per GPU. I was thinking of going hybrid - i.e. finding a way to switch between OpenACC CPU multi-core and GPU and be able to launch the job with the proper number of ranks, threads and set affinity so that the multicore is efficient - but that is something that sounds like a mess to deal with…

Do you know of any future CUDA library, NVIDIA driver, or new hardware that plans to address the “small kernel” issue?
Are there any special OpenACC clause options that would speed up a small kernel such as setting certain valued of vector length, gangs, etc?
Any compiler flags (that would not mess up the large kernels)?

Thanks for any help you can give on this,

  • Ron

Hi Ron,

Any compiler flags (that would not mess up the large kernels)?

No, since this more an algorithmic issue, the compiler isn’t going to help you much.

Are there any special OpenACC clause options that would speed up a small kernel such as setting certain valued of vector length, gangs, etc?

If the are multiple of these small kernels launched consecutively, then async is the best bet since you can hide the launch latency. Though if the launch latency is longer than the kernel time, then this wont help much. (CUDA Graphs may help though, more on this below)

Another possibility is to batch them. While I haven’t tried this myself, I’m thinking something like:

!$acc parallel loop
do i=1,number_of_kernels
    if (i.eq.1) then   !! or possibly a case statement
         !$acc loop vector
          ... kernel 1 ...
    else if(i.eq.2) then
         !$acc loop vector
         ... kernel 2 ...
... etc.

This way each gang will execute an individual kernel (as a vector) thus reducing the launch overhead and better utilizing the GPU. Though, this requires that there are no dependencies between the kernels.

If there are dependencies, then you might be able to switch to using CUDA Fortran for this portion and then use cooperative groups to achieve global synchronization. I posted an example code for this as part of my reply here: program hangs when copying between host/device

Do you know of any future CUDA library, NVIDIA driver, or new hardware that plans to address the “small kernel” issue?

CUDA Graphs may be useful here, assuming you’re repeatedly the same small kernels over and over. You’d need to switch to using CUDA Fortran, though. I have an example in the following post.

-Mat