Sign determination in CUDA

Hi,

I have an array of 128 positive and negative ints in shared memory array and I would like CUDA to return as quickly as possible the sign of the leftmost non-negative number (+1, -1 or 0 if all numbers are zero).

Which would be the proper algorithm to use? I suspect it’s a modified reduction algorithm, but how would it have to be modified so it fits the problem description?

Christian

On compute capability 2.x you can probably come up with a fast implementation using [font=“Courier New”]__ballot()[/font] and [font=“Courier New”]__clz()[/font].

Hmm, I target compute 1.1 and later. My original data of 128 ints is in array A which must not be destroyed. So I use another array B of half the size to store intermediates. The idea is to successively compare two neighbor values and take the sign of the right if the left argument is 0. Repeat until a single value is left. Everything is unrolled like in the reduction CUDA samples. The volatile trick forces 2tid and 2tid+1 into registers, where I expect them to stay and get reused.

The 2-way bank conflicts are unfortunate. But I can’t do it like in the SDK samples because I always need to compare immediate neighbors. The EMUSYNC is for emulation mode only (CUDA SDK 2.3), just like in the SDK samples.

#define sign(a) ((a) > 0 ? +1 : ((a) < 0 ? -1 : 0))

#define GETSIGN(a,b) ((a) != 0 ? sign(a) : sign(b))

int tid = threadIdx.x;

    volatile int left, right;

    if (tid <  64)

    {

        left = 2*tid; right = 2*tid+1;

        B[tid] = GETSIGN(A,A);

    }

    __syncthreads();

#ifndef __DEVICE_EMULATION__

    if (tid < 32)

#endif

    {

        { B[tid] = GETSIGN(B,B); EMUSYNC; }

        { B[tid] = GETSIGN(B,B); EMUSYNC; }

        { B[tid] = GETSIGN(B,B); EMUSYNC; }

        { B[tid] = GETSIGN(B,B); EMUSYNC; }

        { B[tid] = GETSIGN(B,B); EMUSYNC; }

        { B[tid] = GETSIGN(B,B); EMUSYNC; }

    }

And to actually answer your question for the reduction operator:

red = left ? left : right;

I’d do the actual sign() operation just once on the final result. Other than that, I think your implementation is optimal.

To avoid the 2-way band conflict you could reorder the B array in shared memory. However __brev() is 2.x stuff as well, so I’d guess you won’t end up any faster.

thank you, I didn’t see that.