In a workhorse kernel profiling reveals that my memory reads are close to optimal, but the resulting atomicAdd() calls are responsible for most of the kernel running time.
Unfortunately the indices(in global memory) which are being updated are going to be mostly random, or I would use shared memory to accumulate/collect values which would be the written to a pre-known chunk of global memory. The input data is processed in the kernel and based on those results the indices for the writes are determined.
Have considered some type of pre-processing step, but this is a really large input data set (0.5 Gb) and that may not end up helping.
Any ideas for some general approaches to remedy such an issue?
“Unfortunately the indices(in global memory) which are being updated are going to be mostly random…”
So, I take it that this implies uncoalesced memory access
You also mention atomicAdd(), so I also take it that you are not merely writing out, but updating - reading and writing
A rather hypothetical proposition:
Uncoalesced global memory writes can hardly be sped up, as global memory writes are cached, implying that threads hardly wait in such cases, also implying that additional threads can not improve throughput, by managing to issue global memory write requests faster/ in greater quantity
However, uncoalesced global reads can potentially be sped up, by increasing the number of threads (warp blocks actually) participating in such requests
(Example: if my thread block has 5 warp blocks, I can have 1 thread per warp block issue an uncoalesced global read, and ideally it would resemble 1 coalesced read, as warp blocks’ uncoalesced read requests are issued in close succession)
See whether you agree/ disagree, and see whether you can use this to your advantage
I am not reading then updating, rather only applying atomicAdd() to somewhat random memory locations. The reads (as I mentioned) are done correctly, just was wondering if there are any tricks to improve atomic operations on random locations.