I have a quite complex function and I want to reduce the number of registers it uses.
Without touching anything, it compiles to 42 registers. I have tried offloading some of the registers to shared memory, but I can’t get lower than 38 registers (and I have enough shared memory to offload 15 registers in there).
Of course, if I use maxrregcount, I can reduce the number of registers used, but at the expense of local memory and in this case the performance is hurt and not improved.
The PTX is quite complex to follow as the function is so big, and before failing back to straining my eyes in there to see if I can do it better than the compiler, I would like to know if it is worth it.
So, in your experience, how well has the compiler performed in keeping register usage to the minimum? Can it be improved by aiding it in cuda code? Is it worth to use raw PTX in these cases?