How openacc improves register utilization

wanghr3231 · October 14, 2018, 1:19pm

How openacc improves register utilization
Profile for the program, profiler analysis results are as follows:

GPU utilization is limited by register usage
the kernel uses 238 registers for each thread(30464 registers for each block).this register usage is likely preventing the kernel from fully utilizing the GPU.Device “GTX750ti” provides up to 65536 registers for each block.because the kernel uses 30464 registers for each block each Sm is limited to simultaneously executing 2 blocks(8 warps).Chart “varying register count” below shows how changing register usage will change the number of blocks that can execute on each SM
Optimization:use the -maxrregcount flag or the_launch_bounds_qualifier to decrease the number of registers used by each thread.This will increase the number of blocks that can execute on each SM.On devices with compute capability 5.2 turning global cache off can increase the occupancy limited by register usage

I want to know how to use openacc to solve the performance degradation caused by insufficient use of registers. I now block limit is 2. The maximum can be 32, which is shown as a red reminder in the profile. So how to use openacc to control the use of registers for each warp Number.
Tips: In the parallel part (for), there are many intermediate variable matrices

Registers/thread =238,max=65536

MatColgrove · October 15, 2018, 2:35pm

Hi wanghr323,

Just like CUDA, with OpenACC the register allocation is done by the back-end ptxas tool. You can have ptxas limit the number of registers by passing in the flag “-ta=tesla:maxregcount:”, where “n” is the number of registers. However, the local memory needed by the kernel doesn’t go away, rather by limiting the number registers, the local memory spills, first to L1/L2 cache and then to main memory. Spilling to cache is fine, but will severely hurt performance if it spills to main memory.

The main methods to reduce registers usage are to limit the number of local scalars in use in your code and/or split the kernel into multiple kernels.

-Mat