threads writing to an array position dependent on a comparison result...

Thares32 · February 25, 2010, 3:28pm

Hallo,

I’d like to CUDAify the following serial code:

[codebox]m=0;

for(i = 0; i < n; i++)

{

if(a[permarray[i]] != 0)

{

      b[m] = i;

      m++;

 }

}[/codebox]

The arrays a and permarray have n elements each, the permarray’s elements are permuted, unique numbers from 0 to n-1.

As a result array b should contain m permuted positions where a’s element isn’t equal 0, apart from that m is the size of b.

I’ve considered incrementing m with an atomic operation (which would make sure that m is computed correctly):

[codebox]

shared m;

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx==0) m=0;

__syncthreads();

    if (idx < n) {

            if (a[permarray[idx]] != 0) {

                    atomicAdd(&m,1);

                    b[m]=idx;

            }

    }

__syncthreads();

[/codebox]

… but this doesn’t prevent the threads from writing into the same positions of the b array.

The sequence of positions in b doesn’t matter, they just have to be there - m of them.

Any ideas?

RoBiK · February 25, 2010, 4:04pm

Hallo,

I’d like to CUDAify the following serial code:

[codebox]m=0;

for(i = 0; i < n; i++)

{
if(a[permarray[i]] != 0)

{

      b[m] = i;

      m++;

 }
}[/codebox]

The arrays a and permarray have n elements each, the permarray’s elements are permuted, unique numbers from 0 to n-1.

As a result array b should contain m permuted positions where a’s element isn’t equal 0, apart from that m is the size of b.

I’ve considered incrementing m with an atomic operation (which would make sure that m is computed correctly):

[codebox]

shared m;

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx==0) m=0;

__syncthreads();
    if (idx < n) {

            if (a[permarray[idx]] != 0) {

                    atomicAdd(&m,1);

                    b[m]=idx;

            }

    }
__syncthreads();

[/codebox]

… but this doesn’t prevent the threads from writing into the same positions of the b array.

The sequence of positions in b doesn’t matter, they just have to be there - m of them.

Any ideas?

Allocate another array of size n, let’s name it c.

inside your kernel do this:

[codebox]

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx < n) c[idx] = a[permarray[idx]] == 0 ? -1 : idx;

[/codebox]

now start an another kernel that does a compact of array c removing the -1 entries and you are done.

Thares32 · February 25, 2010, 10:00pm

The “compacting” part is the problem here, are there any examples or documentation of such a kernel?

mroberts · February 26, 2010, 3:24am

Hi, there is a library called CUDPP that provides such a compact function that you can call from your C code and pass in device pointers to device memory that you’ve already allocated. This function will put the compacted elements in a global memory buffer of your choosing. The compact operation is implemented efficiently and does not use atomic memory operations. I use it in my application and the performance is very reasonable (1m elements down to 200k elements in about 1 ms).

There is a bunch of literature on how this operation (“stream compaction”) works and how you can do it efficiently in parallel, which I encourage you to check out, or you can jump straight to CUDPP if you just want to compact your buffer and be done with it :)

cheers,

mike

RoBiK · February 26, 2010, 11:08am

just to add to what mike wrote…
you can find a good description of parallel stream compaction or the more general parallel prefix sum algorithm here.

cheers
Robert

Thares32 · March 2, 2010, 9:03am

Thanks to all for the info :)

Topic		Replies	Views
shared iterator for all threads CUDA Programming and Performance	1	637	May 7, 2015
Incrementing a "counter" CUDA Programming and Performance	5	2063	January 28, 2013
how to parallel this simple algorithm? someone has met this before? CUDA Programming and Performance	2	7347	December 3, 2010
Global memory access how to access the same location sequentially from different threads CUDA Programming and Performance	4	4455	July 29, 2010
parallel find find multiple items from a array CUDA Programming and Performance	4	4478	February 23, 2009
write results in parallel creating an unknown number of data elements in each thread CUDA Programming and Performance	5	2448	January 21, 2010
Atomic Functions CUDA Programming and Performance	1	830	September 22, 2011
Atomic counter as array index CUDA Programming and Performance	2	4534	June 20, 2014
Thread Synchronisation in parallel array write CUDA Programming and Performance	4	660	April 1, 2017
'coompacting' result array CUDA Programming and Performance	4	4520	May 13, 2010

threads writing to an array position dependent on a comparison result...

Related topics