If I take the C code I have and compile it with -maxrregcount 16 and -ptxas-options=-v it reports that 16 registers are used for the kernels in the source, as expected.
When I compile without maxrregcount in the command the register counts are greater (18 & 20). However the .ptx code I get is identical.
I must be missing something but I don’t see what. I’m trying to reduce the number of registers so that I can get higher occupancy.
Also the extra storage seems to be from local memory (lmem). Since shared memory is faster, is it possible to force the compiler to use shared memory?
Any help would be appreciated.