"any"/"all" boolean operation between threads Efficient thread co-oporation

wumpus · February 12, 2008, 5:00pm

I need an efficient way to implement these three cases:

does a group of threads agree (ie, all threads evaluate some boolean operation to true)
does at least one of this group of threads evaluate some boolean expression to true
get the minimum value of all the values held by a group of threads

1 and 2 could be implemented by a conditional write to shared memory, like this:

__shared__ int any;

any = 0;

__syncthreads();

if(some_expression(threadIdx.x))

    any = 1;

__syncthreads();

I know that the order in which the threads write to shared memory is undefined, but this should be safe, right?

The only way to do 3 is a reduction, I suppose.

quak · February 12, 2008, 5:47pm

It might even be worth implementing case 1 and 2 as reductions to avoid maximum degree bank conflicts: Use an array of shared values any[16] and write with each thread of a half-warp to a different bank. After all threads have finished writing use a light weight reduction without syncthreads() to determine the final result.

DenisR · February 12, 2008, 6:34pm

Yes, I would also do all of them in a reduction (if you need to do it on the same data, that might even be faster)

I have e.g. made a kernel that calculates mean, min, max and standard-deviation of a couple of array of values. It is all basically a reduction and very fast.

Sarnath · February 13, 2008, 11:08am

I dunno what reduction is.

BUt let me give a try:

__shared__ volatile int minima;

minima = INFINITY;

local_value = expression(threadIdx.x);

for(i=0; i<N; i++) 

     // N is the number of values (thread groups) u have to compare.

{

       if (local_value < minima)

          minima = local_value;

       else

          break;

}

__syncthreads();

At the end of the code “minima” would have what you desire. But I dont think this will be efficient if your “thread groups” is as big as 512 (like one value to compare for each thread)

DenisR · February 13, 2008, 6:25pm

[quote name=‘Sarnath’ date=‘Feb 13 2008, 01:08 PM’]

I dunno what reduction is.

Check the reduction in the SDK :)

Basically it goes along the lines of :

function(float out, float *in, int num_el)

__shared__ float minimum[NUM_THREADS];

int tid = threadIdx.x;

minimum[tid] = in[tid];

for (int k = tid + threadDim.x; k < num_el; k+= threadDim.x) {

minimum[tid] = min(minimum[tid], in[k]);

}

   if (NUM_THREADS>= 512) { if (tid < 256) { minimum[tid] = min(minimum[tid], in[tid+256]); } __syncthreads(); }

    if (NUM_THREADS>= 256) { if (tid < 128) { minimum[tid] = min(minimum[tid], in[tid+128]); } __syncthreads(); }

    if (NUM_THREADS>= 128) { if (tid <  64) { minimum[tid] = min(minimum[tid], in[tid+64]); } __syncthreads(); }

    

   if (tid < 32)

    {

        if (NUM_THREADS>=  64) { minimum[tid] = min(minimum[tid], in[tid+32]); __syncthreads();}

        if (NUM_THREADS>=  32) { minimum[tid] = min(minimum[tid], in[tid+16]); __syncthreads();}

        if (NUM_THREADS>=  16) { minimum[tid] = min(minimum[tid], in[tid+8]); __syncthreads();}

        if (NUM_THREADS>=   8) { minimum[tid] = min(minimum[tid], in[tid+4]); __syncthreads();}

        if (NUM_THREADS>=   4) { minimum[tid] = min(minimum[tid], in[tid+2]); __syncthreads();}

        if (NUM_THREADS>=   2) { minimum[tid] = min(minimum[tid], in[tid+1]); __syncthreads();}

    }

    

    // write result for this block to global mem 

    if (tid == 0) out = minimum[0];

}

wumpus · February 14, 2008, 8:48am

Thanks, that sounds like a smart idea

wumpus · February 18, 2008, 3:54pm

Just NEVER forget volatile if you are doing reductions without __syncthreads() ! This cost me a few hours of debugging.

Topic		Replies	Views
parallel way to find min CUDA Programming and Performance	21	7175	April 15, 2011
Broadcast for all threads CUDA Programming and Performance	11	9645	April 19, 2010
Reduction for a maximum value for all threads? CUDA Programming and Performance	3	737	August 1, 2011
Most efficient blockmin function? CUDA Programming and Performance	12	4915	April 6, 2009
How to perform multiple small reduction efficiently? CUDA Programming and Performance	3	905	May 24, 2013
shared memory writes CUDA Programming and Performance	6	3144	December 30, 2007
ask help about the SDK demo: reduction CUDA Programming and Performance	5	1200	March 31, 2010
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1800	November 18, 2009
Best way to find many minimums CUDA Programming and Performance	8	2263	January 3, 2018
concurrent memory writes CUDA Programming and Performance	8	5540	September 15, 2008

"any"/"all" boolean operation between threads Efficient thread co-oporation

Related topics