Looking for alternative Algorithm to AtomicAdds for random write locations in Global Memory


I currently have a kernel that relies on a lot of atomic adds to avoid race conditions but is killing performance. In a nutshell the problem is given below:

Write Value | Write Index

0.1 | 2
0.4 | 3
0.5 | 2
0.8 | 2
0.1 | 5
0.9 | 6

I need to write values on the left hand side to the memory locations on the right. The memory locations are completely random and can have multiple writes to the same location and vary throughout my simulation, hence my use of Atomics. Does anyone have any ideas of an algorithm that could be used here to avoid the use of Atomics?

Also note that I’m using a Tesla K40 for this code.

Many thanks in advance,

The problem you’ve stated is very vague.

If these variables also need to be read simultaneously then using atomic operations is your best bet. Since the pascal architecture atomic speed has increased drastically, you could opt for newer hardware.

If you only care about the last contained value, each thread can keep their own copy and you could sign it with a global counter timestamp. Afterwards you can find for each memory location with a min reduction the newest value.

One possible approach is to change the realization of the algorithm.

Problems like this often arise when the algorithm is realized in a way that the affect of each input point is processed. Since an input point can affect multiple or arbitrary output points, these types of challenges occur with this kind of realization. atomics would be a natural way to handle it.

On the other hand, another method may be to realize the algorithm in such a way as to identify what belongs in each output point. Such a realization may touch multiple input points, but doesn’t require atomics, because all processing relative to a particular output point are handled by a single device thread.

Having said all that, general optimization guides should apply. Global atomics on kepler and beyond are actually quite fast. Before tackling a major refactoring effort, it’s good practice to use the profiler with an analysis driven optimization approach. Don’t eliminate the use of atomics just because you heard it is a bad idea or may be a bottleneck. Identify/pareto the top issues with a profiler.

Hi guys, thanks for the quick responses. I’ve profiled the code and removing these atomic operations result in a 45% reduction in runtime, hence me clutching for a unicorn alternative algorithm.

I’d also tried a “pull” approach and this was slower than atomics. Looks like I’ll be updating my hardware!

Thanks again,