shared iterator for all threads

Hi! Is there a method to create something like shared iterator? In my program each thread compares value in large 2D array with some threshold. When some thread find the value that is higher than threshold, it puts this value into global array. Because it is possible that several threads can find such values(there are many values , higher than threshold) I use AtomicAdd to increment index of target array. But , unfortunately, Atomic operation takes so much time, and just AtomicAdd works longer than the other part of program. So, can you suggest some alternative for me?
if(abs(M[tid]) >= threshold)
{
iterator = atomicAdd(iter,1);
dev_list[iterator] = M[tid];
}

One possible approach is to mark each element in the array that meets your decision criteria (this can be done independently, i.e. massively parallel). You then use a stream compaction methodology to “gather” only the marked elements into a final array.

The steps are:

  1. mark each element in the original array (M), using a separate marking array, that meets your decision criteria
  2. perform a prefix sum on the marker array (assumption here is that mark = 1, no mark = 0)
  3. use the prefix-sum-on-the-marker array computed in step 2 as the offset index to move the marked elements from the original array (M) into the final array (dev_list)

Note that synchronization is needed between each of the above steps. Steps 1 and 3, however are trivial to code and can be done in a fully parallel independent fashion. Also note that the final value of the prefix sum computed in step 2 can be used to determine the necessary size for the final array (dev_list) to be allocated in step 3.

thrust has good stream compaction utilities. If you don’t want to use thrust, the basic building block is a prefix sum, to identify the starting position in the final array for each marked element. CUB also has prefix-sum.

Here are a couple discussions of stream compaction using prefix-sum:

http://stackoverflow.com/questions/21674184/stream-compaction-within-cuda-kernel-for-maintaining-priority-queue

http://stackoverflow.com/questions/26757503/stream-compaction-or-array-packing-with-prefix-scan-using-openmp

The benefit of this approach over the atomic approach will be data dependent. For a high density of marked values, the stream compaction approach may be faster. For a low density of marked values, the atomic approach may be faster.