forcing loop unrolls


Since the cuda compiler often shortens the length of the loop unrollled, is there a way i can force it?
My loop length, n , is really large 25,000,000

I use the following:

int i;
#pragma loop unroll
for(int j=0;j<N;j++)

It is shortened to just 17 runs of N. How can i better understand why the compiler shortens it by so much?

thank you

The CUDA compiler compiles with full optimizations by default. This includes dead code elimination. If you have a sequence of assignments (as you have in your loop):

i = 0;
i = 1;
i = N-1;

this is equivalent to simply

i = N-1;

BTW, the pragma as you wrote it doesn’t look correct to me. It should be

#pragma unroll <unrolling-factor>

If I am not mistaken, if the compiler doesn’t know the value of N beforehand, the unrolling is limited?
But then it is not really particular to CUDA, but to C/C++.

I am posting this from the NVIDIA programming guide:

“If no number is specified after #pragma unroll, the loop is completely unrolled.
if its trip count is constant, otherwise it is not unrolled at all.”

So if N is set to 25,000,000 , why would it not completely unroll?

if you are using

pragma loop unroll

that is incorrect

there is no loop

Furthermore, there is almost certainly some unstated limit to loop unrolling.