I am a bit puzzled. Templating, using N
as a template parameter, should definitely work in allowing registers to be used and avoiding the if-statements.
I understand that templating may not fit the requirements of your programming context. The approach I have used successfully in the past is to put function pointers to N
instantiations of the template into an array, then invoke the desired variant this way. It makes creates a bit more overhead for the function call, but allows maximum optimization inside the function. Creating 32 instantiations does not seem excessive to me. Here is an example:
#include <stdio.h>
template<int N>__device__ int f(void) { // N in range (0, 31)
int A[32], sum = 0;
for (int i = 0; i < N; i++) {
A[i] = 1 << i;
}
for (int i = 0; i < N; i++) {
sum = sum * 0.03f + A[i];
}
return sum;
}
typedef int (*fp)(void);
const __constant__ fp func[32] ={
f<0>,f<1>,f<2>,f<3>,f<4>,f<5>,f<6>,f<7>,
f<8>,f<9>,f<10>,f<11>,f<12>,f<13>,f<14>,f<15>,
f<16>,f<17>,f<18>,f<19>,f<20>,f<21>,f<22>,f<23>,
f<24>,f<25>,f<26>,f<27>,f<28>,f<29>,f<30>,f<31>
};
__global__ void kernel (int N)
{
printf ("result=%d\n", func[N]());
}
int main (void)
{
kernel<<<1,1>>>(31);
cudaDeviceSynchronize();
return 0;
}