Why should the number of registers be kept low

I came across the discussion in this thread on nvidia forum where it is stated that

I would like to know why should the number of registers should be kept low. What is the effect of number of registers on performance.

If you use more registers than the device supports then it has to save them somewhere, and that is in device RAM (global memory). So there is then a latency to save or reload them. called a “spill”
On newer GPU the L2 cache will speed this up a bit, but at the cost of using up cache lines.

NB in another thread someone said ~ having a few less frequently used registers “spill” isn’t a big problem, and may be faster than code that is more complex in order to use fewer registers.
Someone also said it is a good idea if you can make your code so you can easily change the blocksize and try it with different sizes.

The following link “Demystifying GPU Microarchitecture through Microbenchmarking” shows the relationship between number of threads and maximum actual registers.



Can I trust this paper? “Demystifying GPU Microarchitecture through Microbenchmarking”

I have not seen it quoted anywhere and I cannot find any critic at all.


I think one always have to try different versions and see what happens. In some cases you can automate the process. Often the optimization are competing with each other. If you optimize for low register counts, you can run more threads, but in the same time you might end up with opposite effect.

With more registers per thread, fewer threads can run concurrently => Not enough threads to hide latency => processors waste time idling.
With fewer registers per thread, spilling(read/write to memory) occurs. Also the compiler might decide to repeat certain calculation instead of keeping temporary results.

So there is a tradeoff involved, which depends on the kernel. In my experience, a little register spill is ok when more than 24-32 registers are needed per thread.

Things are not so clear. Sometimes less occupancy can give better performance. Search for this presentation “Better Performance at Lower Occupancy” byVasily Volkov