What are the strategies for reducing the number of registers that are being used in your code. Currently, I have a bit of code that I ported to CUDA and even after much tinkering, I am unable to get the number of registers that are being used to come down (currently, it stands at 38!), which restricts the number of concurrent threads quite severely and I am unable to make full use of my card.
I packed a lot of variables into shared memory but I am still unable to get the register usage to an acceptable range. I looked at breaking up my kernel into multiple kernels but it seems quite impossible, I am afraid.
So, what are the things one should look for when attempting to reduce register usage. Sometimes, I am flummoxed by how the compiler decies to use registers (and there is a difference in optimization between windows and linux as well for the same CUDA version).