How openacc improves register utilization
Profile for the program, profiler analysis results are as follows:
GPU utilization is limited by register usage
the kernel uses 238 registers for each thread(30464 registers for each block).this register usage is likely preventing the kernel from fully utilizing the GPU.Device “GTX750ti” provides up to 65536 registers for each block.because the kernel uses 30464 registers for each block each Sm is limited to simultaneously executing 2 blocks(8 warps).Chart “varying register count” below shows how changing register usage will change the number of blocks that can execute on each SM
Optimization:use the -maxrregcount flag or the_launch_bounds_qualifier to decrease the number of registers used by each thread.This will increase the number of blocks that can execute on each SM.On devices with compute capability 5.2 turning global cache off can increase the occupancy limited by register usage
I want to know how to use openacc to solve the performance degradation caused by insufficient use of registers. I now block limit is 2. The maximum can be 32, which is shown as a red reminder in the profile. So how to use openacc to control the use of registers for each warp Number.
Tips: In the parallel part (for), there are many intermediate variable matrices