I have an algorithm which transforms a 2D map into another 2D map of the same dimensions (each pixel is computed according to a complex formula that’s besides the point of this post.)

Besides the value for each pixel (written into the output map), I also compute a boolean for each pixel. The final output of the algorithm is the output map, plus the number of pixels for which the boolean is true.

The naive use of atomicAdd to implement this counter proved to be extremely inefficient.

OTOH, I have a DirectX 9 pixel shader version of this code which uses occlusion test for the count, which is an order of magnitude faster than the CUDA version, and commenting out the atomicAdd makes the CUDA version a bit faster than the shader.

So it seems that the problem is in the atomicAdd, but now I’m wondering what’s the best way to implement this global counter?