Confused use of register

I have these two code clip, they result the same

__device__ f(int N) { // N in range (0, 31)
    int A[32];
    for (int i = 0; i < N; i++) {
        // do something to A[i]
    }
}
__device__ f(int N) { // N in range (0, 31)
    int A[32];
    for (int i = 0; i < 32; i++) {
        if (i >= N) {break;}
        // do something to A[i]
    }
}

but the first one will use local memory, the second one will use register. What is the difference?

The compiler can completely unroll the second loop as its iteration count is a small compile-constant number. But it cannot do that for the first loop, as the iteration count is only know at runtime. After completely unrolling the second loop the compiler find that all instances of A[i] map to compile-time constant addresses allowing each to be replaced with a scalar variable.

To be clear, the compiler can partially unroll the first loop, e.g. by a factor of 4 or 8, but that is not sufficient to map each instance of A[i] to a scalar variable. When it considers unrolling, the compiler has no knowledge that N < 32, it just knows that N is a value only known at runtime.

So it seems like the second code clip add one extra cmp operation per loop than the first code clip. Is there any way to avoid it? Or to write the code in a more elegent way?

These if-statements are unavoidable as the loop needs to run to N and not 32.

In order to make to make the code more elegant (by some definition of elegant), you could experiment with what happens to the first loop if you supply the compiler with range information for N by the use of __builtin_assume(). Worth a try but probably not sufficient to have the compiler automagically transform loop 1 into loop 2. In a second step you could try combining the use of __builtin_assume() with inserting#pragma unroll 32 just before the loop.

Depending on how this function is being used, you could also consider making it into into a templated function using N as a template parameter, such that each instantiation of the template sees N as a compile-time constant.

These 2 method all doesn’t work. Thanks!

I am a bit puzzled. Templating, using N as a template parameter, should definitely work in allowing registers to be used and avoiding the if-statements.

I understand that templating may not fit the requirements of your programming context. The approach I have used successfully in the past is to put function pointers to N instantiations of the template into an array, then invoke the desired variant this way. It makes creates a bit more overhead for the function call, but allows maximum optimization inside the function. Creating 32 instantiations does not seem excessive to me. Here is an example:

#include <stdio.h>

template<int N>__device__ int f(void) { // N in range (0, 31)
    int A[32], sum = 0;
    for (int i = 0; i < N; i++) {
        A[i] = 1 << i;
    }
    for (int i = 0; i < N; i++) {
        sum = sum * 0.03f + A[i];
    }    
    return sum;
}

typedef int (*fp)(void);
const __constant__ fp func[32] ={
    f<0>,f<1>,f<2>,f<3>,f<4>,f<5>,f<6>,f<7>,
    f<8>,f<9>,f<10>,f<11>,f<12>,f<13>,f<14>,f<15>,
    f<16>,f<17>,f<18>,f<19>,f<20>,f<21>,f<22>,f<23>,
    f<24>,f<25>,f<26>,f<27>,f<28>,f<29>,f<30>,f<31>
};

__global__ void kernel (int N)
{
    printf ("result=%d\n", func[N]());
}

int main (void)
{
    kernel<<<1,1>>>(31);
    cudaDeviceSynchronize();
    return 0;
}
1 Like

I will try it, thank you so much!