decreased register usage: a proposal Easy way to decrease register usage


I am having serious problems with register usage explosion. I have read
all the messages on the subject, and it is clear that the optimization of register
usage is complex and takes many things into account (bank conflicts, etc.)
However, while you decrease bank conflicts, you might increase bandwidth
in other parts of the code.

I would like to propose a simple solution to give the users “some” control.
Perhaps there could be compiler directives inserted into the CUDA code, such as



The compiler would only optimize register usage between such pairs of statments.
In the absence of such statements, the compiler would optimize the entire
code. This should be very easy to do since the optimizer more than likely
optimizes code between some initial statement and some end statements. The point is
that optimization could not cross these “barrier” statements.

Could somebody please comment on this idea’s feasibility. This would help many people
working with complex simulations. Thanks.


Can anybody tell me why my proposal is impractical, not doable, etc? Some feedback from some of the experts would be nice? Or some support :-)


While you wait for an expert to take notice of this thread, you might want to see if you can construct a kernel (or pair of kernels) which shows this huge jump in number of registers. I can almost guarantee that the NVIDIA folks will want to see that first. :)

Well, maybe it’s practical, but it’s not a common practice to put such compiler control statement in comments…:D
Usually we would use #pragma blablabla to control compiler optimization flags. You can specify optimization level by this, so I believe the register allocation optimization control can be specified in the same way. However, the actual problem here is not how to define or control such optimization but how to do this kind of register optimization. I believe they should have some smart way to solve this already, just need some time to implement it. Let’s wait for CUDA 3.0! (is it too far far far away? :D)

Thanks for the two replies. I don’t mind constructing an example kernel, but by definition, it will have to be rather a large kernel. Register explosion typically hapens only with large problems. It has been reported in this Forum before. When calling a method multiple times, register usage is often greater than calling it once. There has been discussion of using volatile to decrease usage (it has for me), and I even got 30 percent speedup with with the NVCC compiler option.

However, these techniques are very hardware dependend.

I know that optimization comments (or pragmas) inside code is not common
practice. However, languages like CUDA with explicit control of so many
threads is also not common practice. So common practice is not the reference.
There is precedent: OpenMP works through compiler directives. Baiscally,
I was just offering a quick solution.

I Thanks,