BUG? nvcc fails to unroll the loop

Hi,

I can’t make the loop in the following code to be unrolled,

__global__ void func(float4* _o)

{

  const int BLOCK_DIM_X = 512;

  #pragma unroll

  for (int i = BLOCK_DIM_X/2; i > 1; i /= 2)

  {

	_o[i] = make_float4(0,0,0,0);

  }

}

Any idea how to unroll it ? Seems like another bug.

Also the compiler fails to factor out make_float4(0,0,0,0), generating inefficient code.

You could unroll it by using the LOCAL_ITERATE macros of Boost::preprocessor. You can do some clever tricks with it. See http://forums.nvidia.com/index.php?showtopic=88814&hl=

According to the programming guide, the #pragma unroll directive will only unroll the loop if the compiler can figure out how many iterations it has. It is quite possible that the form of your loop counter is too complex for the compiler to infer the number of iterations. You can also put an explicit number after unroll, if you know the number of iterations will be a multiple of the unroll value.

Does this work?

_global__ void func(float4* _o)

{

  const int BLOCK_DIM_X = 512;

  #pragma unroll 8

  for (int i = BLOCK_DIM_X/2; i > 1; i /= 2)

  {

	_o[i] = make_float4(0,0,0,0);

  }

}

Well, imho, it is just that compiler in general currently is in a pretty weak state. Constant propagation is one of the basic optimization techniques for compilers these days.

Hopefully it won’t take long to fix that.

While I can agree with your general statement, I don’t see how this particular problem is a constant propagation issue. I’m not familiar with the innards of nvcc (which is based on the Open64 compiler), but I assumed that the loop unroller cannot figure out how many iterations this loop will have because the loop counter is advanced by repeated integer division (or hopefully bit shifting) rather than a simple increment/decrement operation. Do more mature compilers know how to unroll a loop like this?

Microsoft’s shader compiler easily unrolls it.

OK, I did some experimenting with the compiler, and discovered that the loop as written above is never unrolled, even if you give #pragma unroll an explicit unroll parameter. (It also does convert the integer division to bit shifting, as you would hope.) Something about that form of the loop is disabling the entire loop unroller, which I think is a definite bug in the case of the explicit unroll parameter, and a good feature request in the case of the generic #pragma unroll (especially given that the MS shader compiler can do it).

This code (while uglier) does unroll completely, and the compiler is smart enough to precompute the 1 << i values:

__global__ void func(float4* _o)

{

  const int BLOCK_DIM_X_LOG2 = 9;

  const int BLOCK_DIM_X = 1 << BLOCK_DIM_X_LOG2;

  #pragma unroll

  for (int i = BLOCK_DIM_X_LOG2 - 1; i > 0; i-=1)

  {

	_o[1 << i] = make_float4(0,0,0,0);

  }

}

This would have been shorter, but I couldn’t find a way to get the compiler to compute log2(X) at compile time.