NVCC won't unroll for loop

Hi,

nvcc won’t unroll the following loop and I have no idea why. The compilers complains that it cannot deduce the trip count:

Advisory: Loop was not unrolled, cannot deduce loop trip count

The for loop looks like this:

#define BLOCK_SIDE_LENGTH 16

...

__shared__ float smin[BLOCK_SIDE_LENGTH * BLOCK_SIDE_LENGTH];

...

unsigned int tid = threadIdx.y * blockDim.x + threadIdx.x;

...

#pragma unroll

for (int i = (BLOCK_SIDE_LENGTH * BLOCK_SIDE_LENGTH) / 2; i > 0; i >>= 1)

{

	if (tid >= i) break;

	if (smin[tid+i] < smin[tid]) smin[tid] = smin[tid+i];

}

Any ideas?

Thanks,

Kwyjibo

Probably the conditional break inside the loop is the problem. The loop stop condition can’t be evaluated at compile time, so the trip count can’t be determined.

I just tried to comment out the conditional break, but the problem stays the same.

I’d guess that the formulas used in the for statement just don’t fall into the usual patterns the compiler checks for.

Try making that

#define LOG2_BLOCK_SIDE_LENGTH 4

#define BLOCK_SIDE_LENGTH (1<<LOG2_BLOCK_SIDE_LENGTH)

#pragma unroll

for (int j = 2 * LOG2_BLOCK_SIDE_LENGTH - 1; j >= 0; j--)

{

        unsigned int i = 1<<j;

if (tid >= i) break;

        if (smin[tid+i] < smin[tid]) smin[tid] = smin[tid+i];

}

@tera: I just tried your proposal, but it didnt work. However it works when commenting out the line that accesses the shared memory array.

#pragma unroll

for (int i = (BLOCK_SIDE_LENGTH * BLOCK_SIDE_LENGTH) / 2; i > 0; i >>= 1)

{

        if (tid >= i) break;

        if (smin[tid+i] < smin[tid]) smin[tid] = smin[tid+i]; //unrolling works when commenting out this line

}

I can not see, why this line prevents the compiler from unrolling???

I can’t either, but just realize that it can be optimized. Maybe you’ll have more luck with this one:

#pragma unroll

for (int i = (BLOCK_SIDE_LENGTH * BLOCK_SIDE_LENGTH) / 2; i > tid; i >>= 1)

{

        smin[tid] = min(smin[tid], smin[tid+i]);

}

Tanks again for your answer. Your code compiles and works, but still no luck with unrolling the whole thing. I start thinking about filing a bug report and meanwhile unrolling the whole thing by hand or via template parameters (as shown here: http://www.codeproject.com/KB/cpp/loopunrolling.aspx).

In my experience to make the loop unrolled you need to have the for loop of the following form:

for (int i = some_constant; i < some_constant; i += some_constant)

so, in your case, just try

#pragma unroll

for (int i = 0; i < 2*LOG2_BLOCK_SIDE_LENGTH; ++i)

{

  ... //use  1<<(2*LOG2_BLOCK_SIDE_LENGTH-1 - i)) instead of i;

}

Hello sergeyn,

it is definitely not the for statement. Without the shared memory line, the is perfectly unrolled. Nevertheless, I will give your solution a try.

It is both the form of the for loop and the break statement. Either of them will prevent loop unrolling.

While in the case of the for statement that is expected, to me it comes as a surprise that the break statement may not be part of unrolled code - it seems so easy to implement (although it partly defies the reason of unrolling).

On a side note, I would guess that this code is not doing what you think it does - do not expect the minimum of smin to proliferate into smin[0]…smin[blockDim.x*blockDim.y-1]. I believe what you actually wanted to do is

#define BLOCK_SIDE_LENGTH 16

#define BLOCKDIM          64    // blockDim.x*blockDim.y, however fixed at compile time already 

#if BLOCK_SIDE_LENGTH*BLOCK_SIDE_LENGTH % BLOCKDIM

# error "square(BLOCK_SIDE_LENGTH) must be a multiple of BLOCKDIM"

#endif

#pragma unroll

    for (unsigned int i = BLOCKDIM; i < BLOCK_SIDE_LENGTH * BLOCK_SIDE_LENGTH; i += BLOCKDIM)

    {

        smin[tid] = min(smin[tid], smin[tid + i]);

    }

which then can be further optimized to

#define BLOCK_SIDE_LENGTH 16

#define BLOCKDIM          64    // blockDim.x*blockDim.y, however fixed at compile time already

#if BLOCK_SIDE_LENGTH*BLOCK_SIDE_LENGTH % BLOCKDIM

# error "square(BLOCK_SIDE_LENGTH) must be a multiple of BLOCKDIM"

#endif

float tmin = smin[tid];

#pragma unroll

    for (unsigned int i = BLOCKDIM; i < BLOCK_SIDE_LENGTH * BLOCK_SIDE_LENGTH; i += BLOCKDIM)

    {

        tmin = min(tmin, smin[tid + i]);

    }

    smin[tid] = tmin;

I am trying to find the minimum in an image of float numbers. I have tested my code against hundreds od random testframes and it works perfectly fine.

The kernel actually reads the float values of the repective 2-dimensional block into a 1-dimensional shared memory array. The for loop performs a parallel reduction over the values such that smin[0] contains the minimum of the block.

The unrolling problem indeed seems to be a combination of the for loop and the break statement. After rewriting the kernel as follows, the code unrolls, when commenting out the break statement, so I am one step further :)

#pragma unroll

for (int i = BSL_INIT_LOG2 - 1; i >= 0; i--)

{

	unsigned int j = 1 << i;

	if (tid >= j) break;

	s_min[tid] = min(s_min[tid], s_min[tid+j]);

}

So the remaining problem is the break statement. I’ll have a look how I can get rid of it.

Okay, I found a solution:

#pragma unroll

for (int i = BSL_INIT_LOG2 - 1; i >= 0; i--)

{

	unsigned int j = 1 << i;

	if (tid < j)

		s_min[tid] = min(s_min[tid], s_min[tid+j]);

}

Unrolls and works :clap:!

Lessons I have learnt today:

a) For loop unrolling keep for loop simple and …

b) … do not use break inside of loop.

Thanks for your help, folks!

Kwyjibo