Using atomics to sum into a large vector

I have multiple threads accumulating values into a large vector. Many threads write into different locations in the vector, but there is also some overlap in the write destinations. For example:

thread 1:
[ 2 3 2 0 0 0 0 0 0… ]

thread 2:
[ 0 0 1 1 2 0 0 0 0… ]

thread 3:
[ 0 0 0 1 0 0 3 0 3… ]

etc.

Note that the zeroes wouldn’t actually be stored. Each thread stores each nonzero value and the index into the result vector where the value will go. The resulting sum vector would be:
[ 2 3 3 2 2 0 3 0 3…]

I’m considering using atomicAdd to accumulate values into a global vector. This way, I don’t have to store intermediate sums, as I would if I used parallel reduction. Can I still get decent parallelism if I use atomicAdd, if many threads are still writing into different memory locations? Or are there too many drawbacks to using atomics?

I see no fundamental problem with your approach, considering that the degree of overlap is moderate or low.

Consider doing an intermediate sum in shared memory using shared memory atomics if a direct write to global memory turns out to be slower than expected.