write results in parallel creating an unknown number of data elements in each thread

Raphael · January 18, 2010, 1:49pm

Hello Cuda-Gurus,

I have a rather simple problem: each thread computes a unknown amount of data (vertices) and i would like to collect them into a global array.
How can I avoid to loose a lot of performance due to write clashes?

What is the best alternative to using an index variable and perform atomic writes using that index variable? Or is this a good idea?

Thanks!
Raphael

Cygnus_X1 · January 18, 2010, 10:53pm

I would do it the following way:

each thread stores in shared memory how many elements it wants do write. (T[threadIdx.x]:=amount_of_data)
you perform a prefix sum (prefix scan) on array T. As a result each cell of the array holds the sum of all elements before it. There are efficient algorithms for that, google it or even search this forum :)
last cell of the array should hold number N - a number of all data to be stored by the whole block.
atomically increment global index by N (prev:=atomicAdd(ptr,N).
Now each thread may safetly store its data under cells prev+T[threadIdx.x-1] … prev+T[threadIdx.x]

If the order of data is not significant for you and one thread may store its variable-length data at various positions (not necessairly one after another), you might want to consider using the reserved memory of size N differently, to have a more coalesced write instruction.

eelsen · January 18, 2010, 11:19pm

If the number of vertices output by each thread is not too disparate or at least has a reasonable upper bound then you could simply assign each thread an output “bucket” followed by a stream compaction. You could use thrust http://thrust.googlecode.com/svn/tags/1.1…compaction.html or cudpp http://www.gpgpu.org/static/developer/cudp…027140aae9c51bd.

kbam · January 19, 2010, 12:46am

If the threads dont know in advance how many vertices they will output then each thread (or block) could be assigned a ‘chunk’ of space, and if it fills that could get another chunk. So adapting Cygnus X1 and eelsen’s suggestions.
Its kind of the reverse of [url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...mp;#entry584153[/url]

Raphael · January 21, 2010, 10:31am

so you think something simple like this would not work?

push_back(vertex) {

   //get index for current write

   uint curIdx = atomicExch(&numVertices, numVertices+1);

		

   //set vertex

   vertices[curIdx] = vertex;

}

avidday · January 21, 2010, 10:54am

Using atomic exchange like that definitely won’t work - you are effectively defeating the atomic access by using a non atomic read in that example. The only safe way to do that is to use an atomic increment function. But the other suggestions are much better. Break up your output space into chunks, one for each block. Have all the threads in a block write into their own chunk (that way you can use block level synchronization, shared memory, shared memory atomics and all the other useful block level facilities which will make things faster). Block level memory access also gives you the opportunity to coalesce the global memory writes. Use global memory atomics only when a block fills its current output chunk and needs a mutex on the global variable that points to the next free chunk.

Topic		Replies	Views
numbers of write to global memory for each thread CUDA Programming and Performance	3	2144	March 31, 2008
Writing results into global array for only some threads CUDA Programming and Performance	5	1784	April 6, 2009
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25950	July 5, 2007
Thread Synchronisation in parallel array write CUDA Programming and Performance	4	638	April 1, 2017
Updating Global Array by multiple thread/blocks CUDA Programming and Performance	3	4274	July 23, 2010
How to deal with multiple threads writing to the same GPU memory location? CUDA Programming and Performance	3	6002	October 31, 2008
How to write the large amount of result from threads performance effectively in cuda CUDA Programming and Performance	6	1228	October 9, 2014
Controlling where a thread writes its data CUDA Programming and Performance	0	1255	March 5, 2011
How can I control concurrent write access to global memory? CUDA Programming and Performance	2	927	February 22, 2010
writing to global memory in kernel can each thread write different amount of data into an array? CUDA Programming and Performance	0	695	December 4, 2009

write results in parallel creating an unknown number of data elements in each thread

Related topics