Hi everyone,
I am seeing some strange behavior from NVCC when compiling kernels that use an integer template parameter to specify the size of an array. The basic idea is I have a function that makes use of an array to compute a final answer. The length of this auxiliary array is an argument to the kernel.
Most frequently, this auxiliary array needs only to be quite small (10 - 30 unsigned char elements), but the kernel (via several nested loops) makes a large number of reads/writes from/to this array. My current effort is in optimizing this kernel by having the array reside in registers instead of global memory. To this end, I’ve suggested to the compiler that my loops need to be unrolled (using #pragma unroll) and I’ve declared the arrays to be of constant size, MAX_SIZE. Further, to make sure the loops actually unroll, I use C++ templates to parameterize the loop. Roughly, my code looks like this:
#define MAX_SIZE 30
template <int L>
__global__ void f() {
unsigned char array[MAX_SIZE];
#pragma unroll
for(int i = 0; i < L; ++i) {...}
...
}
I use a bit of template magic to compile a whole family of kernels with L ranging between 1 and 30. Taking a look at nvcc’s output for just one of these functions, with --ptxas-options=-v, we see that it works as it should:
ptxas info : Compiling entry function '_Z1fILi20EEvPhiS0_S0_S0_i' for 'sm_61'
ptxas info : Function properties for _Z1fILi20EEvPhiS0_S0_S0_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 79 registers, 364 bytes cmem[0]
A bunch of registers get used (and the number used varied with L, as we expect). Cool.
Finally, we’ve arrived at the question.
Because we have limited registers, it would be nice to declare our array to only take as much memory as it needs. To this end, I got rid of MAX_SIZE and replaced it with just the template parameter, L:
template <int L>
__global__ void f() {
unsigned char array[L];
#pragma unroll
for(int i = 0; i < L; ++i) {...}
...
}
Looking at nvcc’s output, though, I see that something is not quite right:
ptxas info : Compiling entry function '_Z1fILi20EEvPhiS0_S0_S0_i' for 'sm_61'
ptxas info : Function properties for _Z1fILi20EEvPhiS0_S0_S0_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 364 bytes cmem[0]
4 registers are used for this kernel and all kernels in the family! Something seems wrong.
When I crank the size of the family, I do start to see >4 registers used per kernel when L is large. Further, when I compare the performance of these two methods, the second method seems to outperform the first by quite a bit (factor of 10); cudaGetLastError doesn’t reveal that anything is going wrong, so it almost seems to me that nvcc senses some optimization for small L that is even better than just using registers to store the arrays. This doesn’t make much sense to me, however, since I was under the impression that the registers were the fastest memory available on the device.
Any clarification of what might be going on would be very appreciated!
Thanks,
Josh