Global counter? Similar to occlusion test in DirectX

I have an algorithm which transforms a 2D map into another 2D map of the same dimensions (each pixel is computed according to a complex formula that’s besides the point of this post.)

Besides the value for each pixel (written into the output map), I also compute a boolean for each pixel. The final output of the algorithm is the output map, plus the number of pixels for which the boolean is true.

The naive use of atomicAdd to implement this counter proved to be extremely inefficient.

OTOH, I have a DirectX 9 pixel shader version of this code which uses occlusion test for the count, which is an order of magnitude faster than the CUDA version, and commenting out the atomicAdd makes the CUDA version a bit faster than the shader.

So it seems that the problem is in the atomicAdd, but now I’m wondering what’s the best way to implement this global counter?

Output the boolean into a texture (same size as 2D map) and do a reduction.

output the boolean into a global array and do a reduction. You cannot write to 2D texture, and texture is not needed for reduction anyway.

I’m wondering how the global pixel counter in the occlusion test render state in DirectX is implemented? Does this also use reduction?

It would appear inefficient for that purpose, because for a render target of NxM pixels, it would require at least N/K x M/K map, where K is the block size used to run pixel shaders in the driver. This memory would have to be allocated with any render target, since we don’t know ahead of time when the occlusion test render state would be triggered.

Is there no other way? A hardware counter of some sort perhaps, similar to those used by the CUDA profiler?

My usage of texture is stemming from older gpgpu when that is all that there was (and hence texture was essentially interchangeable with memory). I did not actually mean a texture in CUDA’s terminology.

Ah, I understand, I am fresh, no GPGPU background ;)

For graphics, occlusion query is implemented as an actual counter in the hardware. Unfortunately there’s nothing like this available from CUDA, but as already mentioned you can do reductions very efficiently (in fact, at memory bandwidth rates):…tion_Harris.pdf

How efficient reduction can be implemented is irrelevant if it can be avoided.

I hope that nVidia sees the problem I’m pointing at, which is that in my particular case, it is faster to use DirectX for something it isn’t meant to be used (in other words, I’m not rendering anything), and still outperform CUDA, on the same nVidia hardware. :)

Have you tried performing the reduction to see how fast that is before saying that using DirectX is faster?

Apart from that, being able to use all hardware features from within CUDA should always be in the top 3 of the wishlist :D