Counting objects into an array

I’m trying to solve a problem using CUDA. I’ve got an algorithm that solves for a set of parameters and I need to increment certain values in an array based on those parameters. The arrays are conceptually 2D but organized in memory as 1D. Many of the threads will produce similar results and I cannot section the results the threads produce. Basically it’s a number of threads all needing to increment the same areas in memory.

I’ve tried two approaches using atomic operators provided in CUDA. The first approach, all threads attempt to write to the global array using atomic ops. The second approach only one thread per block writes all the data for the block. The second approach is marginally faster, but neither are very fast.

Is there an accepted solution to this problem? Can you think of a better way to do it?


How big is the array? A hardware solution to this problem might be getting a GTX 470 or GTX 480. Atomic operations are 20x faster thanks to the 768kB of L2 cache.

From the software side, this sounds like a form of the histogram problem, so searching on how efficient histogramming implementations are written for CUDA could be helpful.

Have a look at…c/histogram.pdf and

The arrays could be larger then 32MB. It’s basically a sparse histogram problem. 99% of the cells in the array will be zero while a few in very specific areas have high counts. I’m thinking some of the previous fast histogram methods are worth a try.

Ah, in that case, I wonder if it would be effective to work in a few steps:

  • Read input elements and write bin numbers to an output array

  • Sort output array

  • Merge and count (not exactly sure how this will work)