I’m trying to solve a problem using CUDA. I’ve got an algorithm that solves for a set of parameters and I need to increment certain values in an array based on those parameters. The arrays are conceptually 2D but organized in memory as 1D. Many of the threads will produce similar results and I cannot section the results the threads produce. Basically it’s a number of threads all needing to increment the same areas in memory.
I’ve tried two approaches using atomic operators provided in CUDA. The first approach, all threads attempt to write to the global array using atomic ops. The second approach only one thread per block writes all the data for the block. The second approach is marginally faster, but neither are very fast.
Is there an accepted solution to this problem? Can you think of a better way to do it?
The arrays could be larger then 32MB. It’s basically a sparse histogram problem. 99% of the cells in the array will be zero while a few in very specific areas have high counts. I’m thinking some of the previous fast histogram methods are worth a try.