write results in parallel creating an unknown number of data elements in each thread

Hello Cuda-Gurus,

I have a rather simple problem: each thread computes a unknown amount of data (vertices) and i would like to collect them into a global array.
How can I avoid to loose a lot of performance due to write clashes?

What is the best alternative to using an index variable and perform atomic writes using that index variable? Or is this a good idea?


I would do it the following way:

  • each thread stores in shared memory how many elements it wants do write. (T[threadIdx.x]:=amount_of_data)
  • you perform a prefix sum (prefix scan) on array T. As a result each cell of the array holds the sum of all elements before it. There are efficient algorithms for that, google it or even search this forum :)
  • last cell of the array should hold number N - a number of all data to be stored by the whole block.
  • atomically increment global index by N (prev:=atomicAdd(ptr,N).
  • Now each thread may safetly store its data under cells prev+T[threadIdx.x-1] … prev+T[threadIdx.x]

If the order of data is not significant for you and one thread may store its variable-length data at various positions (not necessairly one after another), you might want to consider using the reserved memory of size N differently, to have a more coalesced write instruction.

If the number of vertices output by each thread is not too disparate or at least has a reasonable upper bound then you could simply assign each thread an output “bucket” followed by a stream compaction. You could use thrust http://thrust.googlecode.com/svn/tags/1.1…compaction.html or cudpp http://www.gpgpu.org/static/developer/cudp…027140aae9c51bd.

If the threads dont know in advance how many vertices they will output then each thread (or block) could be assigned a ‘chunk’ of space, and if it fills that could get another chunk. So adapting Cygnus X1 and eelsen’s suggestions.
Its kind of the reverse of http://forums.nvidia.com/index.php?showtop…mp;#entry584153

so you think something simple like this would not work?

push_back(vertex) {

   //get index for current write

   uint curIdx = atomicExch(&numVertices, numVertices+1);


   //set vertex

   vertices[curIdx] = vertex;


Using atomic exchange like that definitely won’t work - you are effectively defeating the atomic access by using a non atomic read in that example. The only safe way to do that is to use an atomic increment function. But the other suggestions are much better. Break up your output space into chunks, one for each block. Have all the threads in a block write into their own chunk (that way you can use block level synchronization, shared memory, shared memory atomics and all the other useful block level facilities which will make things faster). Block level memory access also gives you the opportunity to coalesce the global memory writes. Use global memory atomics only when a block fills its current output chunk and needs a mutex on the global variable that points to the next free chunk.