OK, I did some experimenting with the compiler, and discovered that the loop as written above is never unrolled, even if you give #pragma unroll an explicit unroll parameter. (It also does convert the integer division to bit shifting, as you would hope.) Something about that form of the loop is disabling the entire loop unroller, which I think is a definite bug in the case of the explicit unroll parameter, and a good feature request in the case of the generic #pragma unroll (especially given that the MS shader compiler can do it).
This code (while uglier) does unroll completely, and the compiler is smart enough to precompute the 1 << i values:
__global__ void func(float4* _o)
const int BLOCK_DIM_X_LOG2 = 9;
const int BLOCK_DIM_X = 1 << BLOCK_DIM_X_LOG2;
for (int i = BLOCK_DIM_X_LOG2 - 1; i > 0; i-=1)
_o[1 << i] = make_float4(0,0,0,0);
This would have been shorter, but I couldn’t find a way to get the compiler to compute log2(X) at compile time.