How do you do computation using only registers?

I’ve read papers where the authors optimize CUDA code by loading values into registers and doing computations mostly on registers. This is usually faster than using shared memory. But how would one actually implement this? How do you make a CUDA program do most of its computation on registers?

Here is an example of such a paper:

For the most part, a CUDA programmer has to do nothing. Just let the compiler optimizations “do their thing”. By default, thread-local variables are placed in local memory, but when compiling with full optimizations (which is the nvcc default), those variables will usually be pulled into registers. Even small thread-local arrays with compile-time constant indexing can be be scalarized and placed in registers.

While the compiler does a lot of automatic loop unrolling and function inlining to exploit the use of registers to the fullest, at times CUDA programmers can help by forcing function inlining (__forceinline__ attribute) and full unrolling of loops (#pragma unroll). I have personally used manual scalarization on rare occasions, for small matrix operations up to about 10x10 int or float elements. Any manual interference with the compiler heuristics should be checked by machine code inspection (cuobjdump --dump-sass) and profiling to make sure it is not counterproductive.

The number of registers available to each thread on GPUs is variable, not fixed as in common CPU architectures, with up to 255 32-bit registers available per thread on all GPU architectures supported by recent versions of CUDA. Therefore total register storage available may exceed what is available on CPUs, depending on the kind of computation. Nonetheless there may be fewer registers provided by hardware than the CUDA compiler wants to use, in which case spilling of register contents back to local memory can occur. The compiler can report when this happens (-Xptxas --verbose).

Here is an example of refactoring a matrix transpose that originally used an algorithm that was not amenable to “registerizing” an array by the compiler, into an algorithm that was amenable to that.

As stated there, this may not be a sensible thing to do (matrix transpose), so I point it out for instructional purposes only.