Simple kernel producing wrong results:

I might be missing something obvious, but I get different results when I run this kernel over and over again:

__shared__ double smem[1];
  double val = 0.0;
  for (int i = 0; i < 64; i += 1)
  {
    if (threadIdx.x == 0)
      smem[threadIdx.x] =  i;

    __syncthreads();

    val += smem[0];
  }
  ptr[blockIdx.x*64 + threadIdx.x] = val;

I run this with thread block size of 64 and grid size of 128. What am I doing wrong here ?

Thanks.

Edit:
I observe different results from the previous run on random thread blocks but always in the 2nd warp (i.e - last 32 values of a block differ). There seem to be only one block failing.

Add a __syncthreads() after the val += smem[0]; line

Warp 0 can race ahead of the other warps. Take this scenario:

all warps sync at the syncthreads. Then warp 0 proceeds. It updates it’s local val variable. It then continues on the for loop and updates smem[0] to the next value of i. Then it waits at the barrier.

After that, warp 1 picks up and continues executing. But it now updates its val with the smem[0] value which has been updated again by warp 0.

The result of this behavior is that warp 0 will always produce the correct result (2016) but higher warps may return 2016 or some higher number.

Of course ! I knew it is something as embarrassing as this one.

Thanks!