The CUDA compiler compiles with full optimizations by default. This includes dead code elimination. If you have a sequence of assignments (as you have in your loop):
i = 0;
i = 1;
...
i = N-1;
this is equivalent to simply
i = N-1;
BTW, the pragma as you wrote it doesn’t look correct to me. It should be
If I am not mistaken, if the compiler doesn’t know the value of N beforehand, the unrolling is limited?
But then it is not really particular to CUDA, but to C/C++.