I am hitting around 50% occupancy on my GTX 1080 with 64 register per thread. The bottleneck that is preventing higher occupancy is the number of registers per thread. I could just force a maximum number of threads to be lower than 64 through a compilation flag, but that would lead to leakage into the local memory.
From answers in other forums, the general way to decrease the number of registers used per thread is to:
- take advantage of the shared memory, since shared memory has higher bandwidth than local memory
- breakup a big kernel into smaller kernels, where each kernel computes a portion of what the big kernel computed and save the partial output to a global memory.
The general consensus seemed to be that it is hard to predict which values will be in the register because nvcc will heavily optimize the output.
My question is:
- does using less statically sized local arrays or local variables decrease the use of registers per thread?
- does decreasing the number of parameters decrease the use of registers per thread?
- are there any other methods to decrease registers per thread (without leakage) and increase occupancy?