atomic operations to shared memory

I have two counters in shared memory to reduce the number of global writes. All threads in the block modify the values with atomic adds/subs. In the end, thread 0 writes the final values to global memory. For some reason, this seems to be working perfectly fine for the counterChanges counter, but the values written to cyclesFound and finally to the global nCycles array are incorrect. This works fine if I don’t use the shared memory counter but directly write to global memory instead.

__shared__ int counterChanges;

__shared__ unsigned int cyclesFound;

if (threadIdx.x == 0)

{

  counterChanges = 0;

  cyclesFound = 0;

}

__syncthreads();

...

atomicSub(&counterChanges, frontierVal);

...

atomicAdd(&cyclesFound, frontierVal);

...

atomicAdd(&counterChanges, 1);

...

__syncthreads();

if (threadIdx.x == 0)

{

  atomicAdd(frontierCounter, counterChanges);

  if (cyclesFound > 0)

  {

    atomicAdd(&nCycles[length], cyclesFound);

  }

}

Any ideas why this is not working correctly? I assume the atomic operations to shared memory always see the correct value without a __syncthreads?

Thanks.