Simple Reduction (+) but cannot explain the behavior

This one does NOT work (tmp is shared mem):
__syncthreads();
uint32_t stride = blockDim.x >> 1;
while(threadIdx.x < stride){
tmp[threadIdx.x] += tmp[threadIdx.x + stride];
stride >>= 1;
__syncthreads();
}
__syncthreads(); // Just in case :-)

This one works (tmp is shared mem):
__syncthreads();
while(stride){
if(threadIdx.x < stride)
tmp[threadIdx.x] += tmp[threadIdx.x + stride];
stride >>= 1;
__syncthreads();
}
__syncthreads(); // Just in case :-)

What do you mean by “works”. Do you have compile time or run tine errors?

In general, it would be good if you could show a compilable and executable code for people to try helping you.

By works I mean it computes the correct result. What I would like to know is if there is a subtle logic error involved.