C++ Templates and NVCC's use of registers

Hi everyone,

I am seeing some strange behavior from NVCC when compiling kernels that use an integer template parameter to specify the size of an array. The basic idea is I have a function that makes use of an array to compute a final answer. The length of this auxiliary array is an argument to the kernel.

Most frequently, this auxiliary array needs only to be quite small (10 - 30 unsigned char elements), but the kernel (via several nested loops) makes a large number of reads/writes from/to this array. My current effort is in optimizing this kernel by having the array reside in registers instead of global memory. To this end, I’ve suggested to the compiler that my loops need to be unrolled (using #pragma unroll) and I’ve declared the arrays to be of constant size, MAX_SIZE. Further, to make sure the loops actually unroll, I use C++ templates to parameterize the loop. Roughly, my code looks like this:

#define MAX_SIZE 30
template <int L>
__global__ void f() {

 unsigned char array[MAX_SIZE];
 #pragma unroll
 for(int i = 0; i < L; ++i) {...}
 ...
}

I use a bit of template magic to compile a whole family of kernels with L ranging between 1 and 30. Taking a look at nvcc’s output for just one of these functions, with --ptxas-options=-v, we see that it works as it should:

ptxas info    : Compiling entry function '_Z1fILi20EEvPhiS0_S0_S0_i' for 'sm_61'
ptxas info    : Function properties for _Z1fILi20EEvPhiS0_S0_S0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 79 registers, 364 bytes cmem[0]

A bunch of registers get used (and the number used varied with L, as we expect). Cool.

Finally, we’ve arrived at the question.

Because we have limited registers, it would be nice to declare our array to only take as much memory as it needs. To this end, I got rid of MAX_SIZE and replaced it with just the template parameter, L:

template <int L>
__global__ void f() {

 unsigned char array[L];
 #pragma unroll
 for(int i = 0; i < L; ++i) {...}
 ...
}

Looking at nvcc’s output, though, I see that something is not quite right:

ptxas info    : Compiling entry function '_Z1fILi20EEvPhiS0_S0_S0_i' for 'sm_61'
ptxas info    : Function properties for _Z1fILi20EEvPhiS0_S0_S0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 4 registers, 364 bytes cmem[0]

4 registers are used for this kernel and all kernels in the family! Something seems wrong.

When I crank the size of the family, I do start to see >4 registers used per kernel when L is large. Further, when I compare the performance of these two methods, the second method seems to outperform the first by quite a bit (factor of 10); cudaGetLastError doesn’t reveal that anything is going wrong, so it almost seems to me that nvcc senses some optimization for small L that is even better than just using registers to store the arrays. This doesn’t make much sense to me, however, since I was under the impression that the registers were the fastest memory available on the device.

Any clarification of what might be going on would be very appreciated!

Thanks,
Josh

The description isn’t clear, at least to me (I read the question twice). Probably impossible to diagnose without self-contained code that others can at least compile. If you want to figure out what is going on, inspect the generated machine code (SASS) with cuobjump --dump-sass.

Is the array some sort of lookup table? If so, consider replacing it with computation, that is often faster than lookup tables on GPUs: Memory access can be expensive, floating-point operations are too cheap to meter.

Getting working non-trivial kernels using just 4 registers seems like magic indeed.
It shouldn’t be too difficult however to look at the generated 4-register kernel using “cuobjdump -sass” to see what is going on.

On a side note, unused registers should be optimized away by ptxas anyway, so I don’t see how templating the array size would reduce register count.

Can you post a complete example to allow closer inspection?

last time I checked, #pragma unroll did not work with template parameters. This may be the reason your array ends up in global memory instead of in registers.

I worked around this by using the C++ preprocessor to unroll the loop in a recursive expression (either using C++11 lambdas, or by using recursive macro definitions).

See here for the unrolling idea using lambda expressions: http://cpplove.blogspot.de/2012/07/a-generic-loop-unroller-based-on.html and here for the same idea using functors: http://stackoverflow.com/questions/33957532/how-to-implement-base-2-loop-unrolling-at-run-time-for-optimization-purposes

There’s also a nice article on codeproject explaining both concepts: https://www.codeproject.com/Articles/75423/Loop-Unrolling-over-Template-Arguments

“last time I checked, #pragma unroll did not work with template parameters”

I was wary of this myself, but when I checked with CUDA 8.0 it actually worked.

On a side note, this is why I didn’t refer to Norbert’s answer - I typed mine into the browser but didn’t send it for a while until I had confirmed #pragma unroll to work inside a template.