Wouldn’t a reduction algorithm be a better solution for that (Unless of course you use those atomics just for bare control)?
Atomic operations serialise your code, may significantly reduce the performance, and most likely you won’t get much performance imrovement from running it on multiple processors.
I am using a reduction algorithm but the last step of the algorithm is writing the result to global memory and just for correctness sake i am using atomic operations