Probably the conditional break inside the loop is the problem. The loop stop condition can’t be evaluated at compile time, so the trip count can’t be determined.
@tera: I just tried your proposal, but it didnt work. However it works when commenting out the line that accesses the shared memory array.
#pragma unroll
for (int i = (BLOCK_SIDE_LENGTH * BLOCK_SIDE_LENGTH) / 2; i > 0; i >>= 1)
{
if (tid >= i) break;
if (smin[tid+i] < smin[tid]) smin[tid] = smin[tid+i]; //unrolling works when commenting out this line
}
I can not see, why this line prevents the compiler from unrolling???
Tanks again for your answer. Your code compiles and works, but still no luck with unrolling the whole thing. I start thinking about filing a bug report and meanwhile unrolling the whole thing by hand or via template parameters (as shown here: Loop Unrolling over Template Arguments - CodeProject).
It is both the form of the for loop and the break statement. Either of them will prevent loop unrolling.
While in the case of the for statement that is expected, to me it comes as a surprise that the break statement may not be part of unrolled code - it seems so easy to implement (although it partly defies the reason of unrolling).
On a side note, I would guess that this code is not doing what you think it does - do not expect the minimum of smin to proliferate into smin[0]…smin[blockDim.x*blockDim.y-1]. I believe what you actually wanted to do is
#define BLOCK_SIDE_LENGTH 16
#define BLOCKDIM 64 // blockDim.x*blockDim.y, however fixed at compile time already
#if BLOCK_SIDE_LENGTH*BLOCK_SIDE_LENGTH % BLOCKDIM
# error "square(BLOCK_SIDE_LENGTH) must be a multiple of BLOCKDIM"
#endif
#pragma unroll
for (unsigned int i = BLOCKDIM; i < BLOCK_SIDE_LENGTH * BLOCK_SIDE_LENGTH; i += BLOCKDIM)
{
smin[tid] = min(smin[tid], smin[tid + i]);
}
which then can be further optimized to
#define BLOCK_SIDE_LENGTH 16
#define BLOCKDIM 64 // blockDim.x*blockDim.y, however fixed at compile time already
#if BLOCK_SIDE_LENGTH*BLOCK_SIDE_LENGTH % BLOCKDIM
# error "square(BLOCK_SIDE_LENGTH) must be a multiple of BLOCKDIM"
#endif
float tmin = smin[tid];
#pragma unroll
for (unsigned int i = BLOCKDIM; i < BLOCK_SIDE_LENGTH * BLOCK_SIDE_LENGTH; i += BLOCKDIM)
{
tmin = min(tmin, smin[tid + i]);
}
smin[tid] = tmin;
I am trying to find the minimum in an image of float numbers. I have tested my code against hundreds od random testframes and it works perfectly fine.
The kernel actually reads the float values of the repective 2-dimensional block into a 1-dimensional shared memory array. The for loop performs a parallel reduction over the values such that smin[0] contains the minimum of the block.
The unrolling problem indeed seems to be a combination of the for loop and the break statement. After rewriting the kernel as follows, the code unrolls, when commenting out the break statement, so I am one step further :)
#pragma unroll
for (int i = BSL_INIT_LOG2 - 1; i >= 0; i--)
{
unsigned int j = 1 << i;
if (tid >= j) break;
s_min[tid] = min(s_min[tid], s_min[tid+j]);
}
So the remaining problem is the break statement. I’ll have a look how I can get rid of it.