how to reduce the number of registers

Hello,

I’m trying to optimize my CUDA program. The CUDA profiler give me that :

Occupancy analysis :
Kernel details : Grid size: 8 x 1, Block size: 96 x 1 x 1
Register Ratio = 0.875 ( 7168 / 8192 ) [14 registers per thread]
Shared Memory Ratio = 0.125 ( 2048 / 16384 ) [416 bytes per Block]
Active Blocks per SM = 4 : 8
Active threads per SM = 384 : 768
Occupancy = 0.5 ( 12 / 24 )
Achieved occupancy = 0.25 (on 4 SMs)
Occupancy limiting factor= Registers

To increase my occupancy, I need to decrease the use of registers. But how can I do that ?
What instructions, calls imply the creation of a registers ?
Are intermediate calculation put into register ?

Thanks.

You can explicitly limit the number of registers using the -maxrregcount compiler directive, however, it is not obvious that you should really do that. First of all, this directive will likely lead to the spilling of registers into the local memory which is slow (however, it is not a problem on Fermi). Second - maximal occupancy does not necessarily give the maximal speed of kernel execution, it is necessary to test how fast your code is depending on the level of occupancy.

Your grid and block size are not good. They simply do not fill the GPU. This is the problem, not the small amount of registers.

Hello,

Thanks for your two answers.

I tried to limit the number of registers to 8 and I have the following result :

Occupancy analysis for kernel ‘testRandShared’ for context ‘Session11 : Device_0 : Context_0’ :
Kernel details : Grid size: 8 x 1, Block size: 96 x 1 x 1
Register Ratio = 1 ( 8192 / 8192 ) [8 registers per thread]
Shared Memory Ratio = 0.25 ( 4096 / 16384 ) [416 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 768 : 768
Occupancy = 1 ( 24 / 24 )
Achieved occupancy = 0.25 (on 4 SMs)
Occupancy limiting factor = None

But as you said, I’ve got worse performance issues when I’m limiting the number of register.
@lev : I think my grid and block size are fine because with the register limit, I’ve got an occupancy of 1

So we can have no really control in the attribution of register in the code ?
Maybe with the reuse of variable ?

No, look at your numbers:

Using only 8 registers, each multiprocessor could run at full occupancy if there were enough blocks to actually fill them, but there aren’t.

No, the compiler already optimizes the lifespan of variables independent of the scope of the variables in your code. So, just changing the names of variables does not help. But actual restructuring your code may help if it enables the optimizer to sufficiently reduce variable lifespan.

You were right, I was focused on occupancy and not achieved occupancy.

So there is no way to control the register :(

Thanks for all !