In my code there is an unavoidable section that computes thousands of iterations on a small piece of data (2D). Using OpenACC to accelerate the code yields very poor results (slower on a V100 than even a 6-core CPU) due to kernel overhead and/or low GPU usage.
All the data is already resident on the GPU so it is not a transfer problem.
This is not a new issue as even my codes back in 2010 performed not-so-great when used with very small problems.
Since these small-data computations are required, are there any hints/suggestions you may have for speeding up the small kernels?
I already tried combining multiple kernels when possible, using async when possible, and collapsing all 2D loops.
I also tried running the small kernels on the CPU but it is inefficient since I am running with 1 MPI rank per GPU. I was thinking of going hybrid - i.e. finding a way to switch between OpenACC CPU multi-core and GPU and be able to launch the job with the proper number of ranks, threads and set affinity so that the multicore is efficient - but that is something that sounds like a mess to deal with…
Do you know of any future CUDA library, NVIDIA driver, or new hardware that plans to address the “small kernel” issue?
Are there any special OpenACC clause options that would speed up a small kernel such as setting certain valued of vector length, gangs, etc?
Any compiler flags (that would not mess up the large kernels)?
Thanks for any help you can give on this,