shared iterator for all threads

M2Mpisatel · May 7, 2015, 2:56pm

Hi! Is there a method to create something like shared iterator? In my program each thread compares value in large 2D array with some threshold. When some thread find the value that is higher than threshold, it puts this value into global array. Because it is possible that several threads can find such values(there are many values , higher than threshold) I use AtomicAdd to increment index of target array. But , unfortunately, Atomic operation takes so much time, and just AtomicAdd works longer than the other part of program. So, can you suggest some alternative for me?
if(abs(M[tid]) >= threshold)
{
iterator = atomicAdd(iter,1);
dev_list[iterator] = M[tid];
}

Robert_Crovella · May 7, 2015, 3:10pm

One possible approach is to mark each element in the array that meets your decision criteria (this can be done independently, i.e. massively parallel). You then use a stream compaction methodology to “gather” only the marked elements into a final array.

The steps are:

mark each element in the original array (M), using a separate marking array, that meets your decision criteria
perform a prefix sum on the marker array (assumption here is that mark = 1, no mark = 0)
use the prefix-sum-on-the-marker array computed in step 2 as the offset index to move the marked elements from the original array (M) into the final array (dev_list)

Note that synchronization is needed between each of the above steps. Steps 1 and 3, however are trivial to code and can be done in a fully parallel independent fashion. Also note that the final value of the prefix sum computed in step 2 can be used to determine the necessary size for the final array (dev_list) to be allocated in step 3.

thrust has good stream compaction utilities. If you don’t want to use thrust, the basic building block is a prefix sum, to identify the starting position in the final array for each marked element. CUB also has prefix-sum.

Here are a couple discussions of stream compaction using prefix-sum:

[url]parallel processing - Stream compaction within cuda kernel for maintaining priority queue - Stack Overflow

[url]fortran - Stream compaction (or Array Packing) with prefix scan using Openmp - Stack Overflow

The benefit of this approach over the atomic approach will be data dependent. For a high density of marked values, the stream compaction approach may be faster. For a low density of marked values, the atomic approach may be faster.

Topic		Replies	Views
adding array elements in shared memory CUDA Programming and Performance	3	1424	February 10, 2009
Global memory access how to access the same location sequentially from different threads CUDA Programming and Performance	4	4402	July 29, 2010
Incrementing a "counter" CUDA Programming and Performance	5	1991	January 28, 2013
threads writing to an array position dependent on a comparison result... CUDA Programming and Performance	5	2828	March 2, 2010
Good solution? CUDA Programming and Performance	1	816	June 11, 2010
Atomic counter as array index CUDA Programming and Performance	2	4483	June 20, 2014
Atomic Functions CUDA Programming and Performance	1	807	September 22, 2011
How to avoid using atomicAdd? CUDA Programming and Performance	0	1085	January 2, 2010
Using atomicAdd to step through an array CUDA Programming and Performance	7	4002	May 24, 2011
Updating Global Array by multiple thread/blocks CUDA Programming and Performance	3	4269	July 23, 2010

shared iterator for all threads

Related topics