I am trying to use the cuda occupancy calculator to improve the performance since I found the performance is not good and the occupancy in my program is only %17. The parameters are as follows:
threads per block: 128. (128 threads in a 1-D block)
register per thread: 41.
shared memory per block: 80.
It seems the number of registers used in my program is too many and less of them would get higher occupancy. It also looks like the compiler decides this number, but what can I do in my program if I want to decrease it? I tried to cut some of the variables in my program, this do decrease the local memory use but has nothing to do with register number.
First of all, you can increase your threads per block to 192 or decrease it to 64, and you’ll get better occupancy. 192*41=7872, which is just under the 8192 register limit on the 8/9-series. If you run 64 threads per block, the runtime will launch three blocks per MP, which will also result in 192 threads.
You can force fewer registers to be used with the -maxrregcount=N parameter to nvcc. Sometimes it doesn’t work well because it ends up moving some variables into slow local memory. You can also force variables into shared mem by replacing them with volatile pointers/references to elements of a shared mem array. The two techniques can be used together.
Lastly, occupancy is by far not a critical factor! It depends. If you make a lot of accesses to global mem, it is important to hide latency, but if you don’t, it’s irrelevant. Also, 192 threads is actually a fair occupancy, even though it’s “only” 25%. The benefits really drop off after 256 threads.