Understanding kernel registers

andrew.stephens · April 23, 2024, 11:45am

I’m new to CUDA and am trying to understand the concept of a kernel register, and how (if?) these relate to variables. Am I right in saying that each local variable defined in a kernel broadly takes up one register? I have a particularly complex kernel where I use a lot of variables, and I’ve heard there is a 64-register limit? Many of these variables are only there for readability, for example in this contrived example:

int i = lookupTable[0];
int j = lookupTable[1];
int x = buffer[i];
int y= buffer[j];

Versus:

int x = buffer[lookupTable[0]];
int y= buffer[lookupTable[1]];

Would the latter version result in fewer registers being used, and if so would the compiler optimise this sort of thing anyway?

As a second example, I might have a series of “if” or “for” blocks, each declaring a variable “i” in their inner scope. Would it improve matters if I was to declare “i” once at the start of the kernel then re-use this in these different code blocks?

What else in the kernel contributes to register use? Arguments? Anything else?

Robert_Crovella · April 23, 2024, 2:01pm

Registers in CUDA GPUs are 32-bits wide, so my comments apply to 32-bit wide variables, like int, float, etc. A 64-bit variable like double requires 2 registers, when used.

A variable will occupy a register when it is actually being used for calculations. At other points during your kernel execution it might occupy a register, or it might not. Some of this is due to the compiler optimizing things, and reusing registers, and some of this is due to the fact that local variables, at least, may be temporarily stored in device memory, not in a register, subject to compiler decisions. There are numerous forum questions discussing these topics.

It’s difficult to analyze code statically for this sort of thing. The usual suggestion is to study the SASS code if you want to know about if/when a register is used for a particular purpose, or what the compiler has “optimized”. CUDA has binary utilities that generally allow you to study the SASS, and furthermore some specific tools for example that can give a “register liveness” view, which allows you to see the “lifetime” of a register used, often for a specific purpose. Again, there are numerous forum questions on these topics. (In my opinion, there is unlikely to be an important difference in compiler code realization between the two source variants you have offered, but that is a rather uncertain statement without a complete example and case to inspect. If you want to be certain, it is necessary to study the SASS with a complete case.)

The final determinant of all this is the SASS code. The compiler aggressively optimizes things, and occasionally misses obvious optimizations, so precise answers depend on a complete code and compilation environment (CUDA version, GPU arch target) as well as the aforementioned binary utilities. Its also possible coax the compiler explorer to show you various aspects of a complete/compilable code, such as the SASS.

Apart from the answers to your questions which I have indicated above, I would offer this piece of advice: The compiler is pretty good at optimization, in my view generally better than most humans. I strongly encourage you to write code that seems readable, sensible, and maintainable to you, and only seek to do such “low-level optimizations” when you have a good reason to do so, which usually means that a profiler has indicated clearly that the code section you are focused on is a top-level performance bottleneck for your application. This is basically an extended version of Donald Knuth’s advice against “premature optimization”.