Registers and Locally declared Variables Variables declared in _global_ functions

Are variables that are declared inside a kernal’s global function stored in local registers? For example is the variable “int bx” inside “multd” function of Cuda guides matrix multiplication example stored in local gpu register? Are there limits to the size of an array of floats that can be stored in register memory?

Yes, local variables are usually stored in registers.

As described in the programming guide, there are a total of 8192 32-bit registers per multiprocessor on G80, and these are shared between all the thread blocks executing on the multiprocessor.

So if you had just a single block of 256 threads, you could have 32 float registers per thread.

The occupancy calculator is a useful tool here:
http://developer.download.nvidia.com/compu…_calculator.xls

The compiler generally tries to minimize the number of registers used. If you use too many registers it will start spilling values to local memory (which is much slower because it is off-chip).

If you index into an array of floats, then this will have to be stored in local memory because the hardware can’t index into the register file.

In my experience, arrays of floats get dumped to local memory and aren’t stored in registers. The compiler is very dumb about such things. I had to turn the array into regular variables and manually unroll all the loops using preprocessor tricks. (The compiler should have unrolled the static loops itself and realized that the array isn’t being dynamically indexed.)

Right now the compiler can’t be trusted. Always inspect the ptx/cubin it produces because it could be costing you an order of magnitude in performance. First-line optimization tricks such as trying to do blocking inside the kernel can actually lead to substantial performance losses as often as not if the compiler decides that what you really need is a trip to local memory every other instruction.

Thanks for the information. Though I have plenty of threads for warp and block swapping to hide device memory access latency, it may be prudent for me to use shared memory for arrays.

I can confirm that something like

template <class T, unsigned int LEN> device void
my_function( … )
{

T my_array[LEN];

… compute something …

}

does in fact use local memory for my_array. Even for LEN = 1 and basic types T such as float2. I was quite surprised to see this as ‘T’ and ‘LEN’ are given at compile time and it should be fairly easy to assign registers insetad.

I was particularly surprised as I chose to take this approach for some kernel based on the following statement in the Cuda programming guide v. 1.0 sec. 4.2.2.4.

“An automatic variable declared in device code without any of these qualifiers
generally resides in a register. However in some cases the compiler might choose to place it in local memory. This is often the case for large structures or arrays that would consume too much register space, and arrays for which the compiler cannot determine that they are indexed with constant quantities.”

I guess the compiler doesn’t detect that much at its present state. So better read “often” as “always” above for now ;-)

– Thomas