I might be missing something obvious, but I get different results when I run this kernel over and over again:
__shared__ double smem[1];
double val = 0.0;
for (int i = 0; i < 64; i += 1)
{
if (threadIdx.x == 0)
smem[threadIdx.x] = i;
__syncthreads();
val += smem[0];
}
ptr[blockIdx.x*64 + threadIdx.x] = val;
I run this with thread block size of 64 and grid size of 128. What am I doing wrong here ?
Thanks.
Edit:
I observe different results from the previous run on random thread blocks but always in the 2nd warp (i.e - last 32 values of a block differ). There seem to be only one block failing.
Add a __syncthreads() after the val += smem[0]; line
Warp 0 can race ahead of the other warps. Take this scenario:
all warps sync at the syncthreads. Then warp 0 proceeds. It updates it’s local val variable. It then continues on the for loop and updates smem[0] to the next value of i. Then it waits at the barrier.
After that, warp 1 picks up and continues executing. But it now updates its val with the smem[0] value which has been updated again by warp 0.
The result of this behavior is that warp 0 will always produce the correct result (2016) but higher warps may return 2016 or some higher number.
Of course ! I knew it is something as embarrassing as this one.
Thanks!