What are the strategies for reducing the number of registers that are being used in your code. Currently, I have a bit of code that I ported to CUDA and even after much tinkering, I am unable to get the number of registers that are being used to come down (currently, it stands at 38!), which restricts the number of concurrent threads quite severely and I am unable to make full use of my card.
I packed a lot of variables into shared memory but I am still unable to get the register usage to an acceptable range. I looked at breaking up my kernel into multiple kernels but it seems quite impossible, I am afraid.
So, what are the things one should look for when attempting to reduce register usage. Sometimes, I am flummoxed by how the compiler decies to use registers (and there is a difference in optimization between windows and linux as well for the same CUDA version).
Nah, not using doubles. Currently, I am looking into maybe rewriting the whole thing to somehow split the damn thing into multiple kernels, but it is quite tough.
I take it you already know about the -maxrregcount compiler flag (you don’t mention it). I have used compressed integer types with some success before (I seem to recall I had 4 8-bit integers packed into an int data type). Apart from that my only other suggestion would be rearranging code.
Declare the per-thread variables (registers and local memory arrays) as volatile. It has helped me reduce the variable usage by a significant amount… all the time.
I seem to recall that using one variable and doing bitwise operations to access it used fewer registers than declaring many variables - I think I tried many variables and it didn’t make a difference. Can’t find the code unfortunetly. Might have just been my special case.