threads writing to an array position dependent on a comparison result...


I’d like to CUDAify the following serial code:


for(i = 0; i < n; i++)


if(a[permarray[i]] != 0)


      b[m] = i;




The arrays a and permarray have n elements each, the permarray’s elements are permuted, unique numbers from 0 to n-1.

As a result array b should contain m permuted positions where a’s element isn’t equal 0, apart from that m is the size of b.

I’ve considered incrementing m with an atomic operation (which would make sure that m is computed correctly):


shared m;

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx==0) m=0;


    if (idx < n) {

            if (a[permarray[idx]] != 0) {







… but this doesn’t prevent the threads from writing into the same positions of the b array.

The sequence of positions in b doesn’t matter, they just have to be there - m of them.

Any ideas?

Allocate another array of size n, let’s name it c.

inside your kernel do this:


int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx < n) c[idx] = a[permarray[idx]] == 0 ? -1 : idx;


now start an another kernel that does a compact of array c removing the -1 entries and you are done.

The “compacting” part is the problem here, are there any examples or documentation of such a kernel?

Hi, there is a library called CUDPP that provides such a compact function that you can call from your C code and pass in device pointers to device memory that you’ve already allocated. This function will put the compacted elements in a global memory buffer of your choosing. The compact operation is implemented efficiently and does not use atomic memory operations. I use it in my application and the performance is very reasonable (1m elements down to 200k elements in about 1 ms).

There is a bunch of literature on how this operation (“stream compaction”) works and how you can do it efficiently in parallel, which I encourage you to check out, or you can jump straight to CUDPP if you just want to compact your buffer and be done with it :)



just to add to what mike wrote…
you can find a good description of parallel stream compaction or the more general parallel prefix sum algorithm here.


Thanks to all for the info :)