threads writing to an array position dependent on a comparison result...

Hallo,

I’d like to CUDAify the following serial code:

[codebox]m=0;

for(i = 0; i < n; i++)

{

if(a[permarray[i]] != 0)

{

      b[m] = i;

      m++;

 }

}[/codebox]

The arrays a and permarray have n elements each, the permarray’s elements are permuted, unique numbers from 0 to n-1.

As a result array b should contain m permuted positions where a’s element isn’t equal 0, apart from that m is the size of b.

I’ve considered incrementing m with an atomic operation (which would make sure that m is computed correctly):

[codebox]

shared m;

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx==0) m=0;

__syncthreads();

    if (idx < n) {

            if (a[permarray[idx]] != 0) {

                    atomicAdd(&m,1);

                    b[m]=idx;

            }

    }

__syncthreads();

[/codebox]

… but this doesn’t prevent the threads from writing into the same positions of the b array.

The sequence of positions in b doesn’t matter, they just have to be there - m of them.

Any ideas?

Allocate another array of size n, let’s name it c.

inside your kernel do this:

[codebox]

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx < n) c[idx] = a[permarray[idx]] == 0 ? -1 : idx;

[/codebox]

now start an another kernel that does a compact of array c removing the -1 entries and you are done.

The “compacting” part is the problem here, are there any examples or documentation of such a kernel?

Hi, there is a library called CUDPP that provides such a compact function that you can call from your C code and pass in device pointers to device memory that you’ve already allocated. This function will put the compacted elements in a global memory buffer of your choosing. The compact operation is implemented efficiently and does not use atomic memory operations. I use it in my application and the performance is very reasonable (1m elements down to 200k elements in about 1 ms).

There is a bunch of literature on how this operation (“stream compaction”) works and how you can do it efficiently in parallel, which I encourage you to check out, or you can jump straight to CUDPP if you just want to compact your buffer and be done with it :)

cheers,

mike

just to add to what mike wrote…
you can find a good description of parallel stream compaction or the more general parallel prefix sum algorithm here.

cheers
Robert

Thanks to all for the info :)