It’s a very common problem, both for reading and writing. This occurs a lot with “jobs” you’re dynamically assigning or reporting results with.
AtomicAdd DOES work but as you say isn’t the most efficient. One trick is to report results in CHUNKS. Accumulate answers in your block shared memory, and when you get “many” of them, use ONE atomic add to make room for the whole group. This works well when your results are small and common.
Another common strategy is to allocate a fixed length array, partition it so each block “owns” a range (or sometimes each input item corresponds to an output). Then there can not be any multiple writes. The disadvantage in this case is there are holes in the output that the user needs to skip over.
These holes can also be removed by using a compaction pass.
A super-fancy way is to use manual dynamic allocation especially if output order doesn’t matter. At runtime, blocks request an output range. This range is found and owned safely using atomic exchange. The block writes its results, and “returns” (via another atomic exchange) any leftover room to the allocator (which stores available partial chunks as a linked list). Later chunks will be able to fill in the gaps with more results. This works pretty well for me when parallel building variable length KD trees for raytracing… it’s actually a full dynamic memory allocator and not just a simple list.