I have a rather simple problem: each thread computes a unknown amount of data (vertices) and i would like to collect them into a global array.
How can I avoid to loose a lot of performance due to write clashes?
What is the best alternative to using an index variable and perform atomic writes using that index variable? Or is this a good idea?
each thread stores in shared memory how many elements it wants do write. (T[threadIdx.x]:=amount_of_data)
you perform a prefix sum (prefix scan) on array T. As a result each cell of the array holds the sum of all elements before it. There are efficient algorithms for that, google it or even search this forum :)
last cell of the array should hold number N - a number of all data to be stored by the whole block.
atomically increment global index by N (prev:=atomicAdd(ptr,N).
Now each thread may safetly store its data under cells prev+T[threadIdx.x-1] … prev+T[threadIdx.x]
If the order of data is not significant for you and one thread may store its variable-length data at various positions (not necessairly one after another), you might want to consider using the reserved memory of size N differently, to have a more coalesced write instruction.
Using atomic exchange like that definitely won’t work - you are effectively defeating the atomic access by using a non atomic read in that example. The only safe way to do that is to use an atomic increment function. But the other suggestions are much better. Break up your output space into chunks, one for each block. Have all the threads in a block write into their own chunk (that way you can use block level synchronization, shared memory, shared memory atomics and all the other useful block level facilities which will make things faster). Block level memory access also gives you the opportunity to coalesce the global memory writes. Use global memory atomics only when a block fills its current output chunk and needs a mutex on the global variable that points to the next free chunk.