Strategies for improving Atomic operation peformance

In a workhorse kernel profiling reveals that my memory reads are close to optimal, but the resulting atomicAdd() calls are responsible for most of the kernel running time.

Unfortunately the indices(in global memory) which are being updated are going to be mostly random, or I would use shared memory to accumulate/collect values which would be the written to a pre-known chunk of global memory. The input data is processed in the kernel and based on those results the indices for the writes are determined.

Have considered some type of pre-processing step, but this is a really large input data set (0.5 Gb) and that may not end up helping.

Any ideas for some general approaches to remedy such an issue?

“Unfortunately the indices(in global memory) which are being updated are going to be mostly random…”

So, I take it that this implies uncoalesced memory access

You also mention atomicAdd(), so I also take it that you are not merely writing out, but updating - reading and writing

A rather hypothetical proposition:
Uncoalesced global memory writes can hardly be sped up, as global memory writes are cached, implying that threads hardly wait in such cases, also implying that additional threads can not improve throughput, by managing to issue global memory write requests faster/ in greater quantity
However, uncoalesced global reads can potentially be sped up, by increasing the number of threads (warp blocks actually) participating in such requests
(Example: if my thread block has 5 warp blocks, I can have 1 thread per warp block issue an uncoalesced global read, and ideally it would resemble 1 coalesced read, as warp blocks’ uncoalesced read requests are issued in close succession)

See whether you agree/ disagree, and see whether you can use this to your advantage

I am not reading then updating, rather only applying atomicAdd() to somewhat random memory locations. The reads (as I mentioned) are done correctly, just was wondering if there are any tricks to improve atomic operations on random locations.

“I am not reading then updating, rather only applying atomicAdd() to somewhat random memory locations.”

You may not explicitly read and update
Regardless, read and update is inherently implied by atomicAdd; otherwise, i am misinterpreting the definition of atomicAdd, but i doubt

One way to speed-up atomics is to duplicate the array you are writing into. This way you spread the conflicts for a single locations over multiple locations.

I don’t know how large the array is you are writing to, but you could do something like:

x = input[...];
atomicAdd(output[x], 1);

and apply a replication of 16:

x = input[...];
atomicAdd(output[(x * 16) + (threadIdx.x % 16)], 1);

Of course you need to change the allocated size of the output, and you will need to combine the replicated output into a single output, but this should be relatively easy.

@Gert-Jan -> Thanks, will give that a try

@little_jimmy -> I misunderstood the previous statement I think, and believe you are correct in your defintion