Here is one thing that I don’t quite understand: While earlier GPUs did not support functions (every function was an inline function) Fermi GPUs do. Can I therefore reduce the register usage by splitting up a kernel into multiple functions? Variables outside of the function scope are not used once the threads enter into a function. The CUDA C Best Practices Guide does says nothing about function calls (I checked section 4.2).
You’ll have to use cuobjdump to see how functions are handled. If it involves saving the registers into local memory or something like that, just consider the trade-off.
However, it’s worth noting that a variable that is no longer needed in later parts of a kernel will usually not occupy any register. ptxas knows how to reduce this kind of wastage at compile-time.