OpenACC registers per thread

Can pgcc report the number of registers-per-thread when compiling OpenACC programs for NVIDA GPUs?

In some older forum posts, I saw that the output from PGI_ACC_NOTIFY included this info. For example:

CC 1.0 : 9 registers; 64 shared, 0 constant, 0 local memory bytes
CC 2.0 : 14 registers; 0 shared, 80 constant, 0 local memory bytes

but my version of pgcc (v14.9) doesn’t give this info.

Hi Ron,

Add the sub-option “ptxinfo” to your compilation, this will display the register per thread information as well as shared memory usage.

-ta=tesla:ptxinfo

You can also set the maximum registers to use via the flag “-ta=tesla:maxregcount:” where “n” is the number of registers.

  • Mat

And, as an aside, explore that option! I’ve found some cases where I could double my performance by choosing a number different than what the compiler used. (Note, I also had cases where is didn’t matter one whit!)

Yes, the number of registers used can have a huge impact on performance. The more registers used per thread means lower occupancy and lower performance. However, using too few registers leads to spilling and again lower performance.

Some spilling is ok since it initially spills to the L1 cache. But once the program spills to global memory the performance tanks. The trick is finding the spot just before spills go to global memory.

Using NVVP or nvprof is very helpful here as is the CUDA Occupancy Calculator spreadsheet.