"any"/"all" boolean operation between threads Efficient thread co-oporation

I need an efficient way to implement these three cases:

  1. does a group of threads agree (ie, all threads evaluate some boolean operation to true)

  2. does at least one of this group of threads evaluate some boolean expression to true

  3. get the minimum value of all the values held by a group of threads

1 and 2 could be implemented by a conditional write to shared memory, like this:

__shared__ int any;

any = 0;



    any = 1;


I know that the order in which the threads write to shared memory is undefined, but this should be safe, right?

The only way to do 3 is a reduction, I suppose.

It might even be worth implementing case 1 and 2 as reductions to avoid maximum degree bank conflicts: Use an array of shared values any[16] and write with each thread of a half-warp to a different bank. After all threads have finished writing use a light weight reduction without syncthreads() to determine the final result.

Yes, I would also do all of them in a reduction (if you need to do it on the same data, that might even be faster)

I have e.g. made a kernel that calculates mean, min, max and standard-deviation of a couple of array of values. It is all basically a reduction and very fast.

I dunno what reduction is.

BUt let me give a try:

__shared__ volatile int minima;

minima = INFINITY;

local_value = expression(threadIdx.x);

for(i=0; i<N; i++) 

     // N is the number of values (thread groups) u have to compare.


       if (local_value < minima)

          minima = local_value;





At the end of the code “minima” would have what you desire. But I dont think this will be efficient if your “thread groups” is as big as 512 (like one value to compare for each thread)

[quote name=‘Sarnath’ date=‘Feb 13 2008, 01:08 PM’]

I dunno what reduction is.

Check the reduction in the SDK :)

Basically it goes along the lines of :

function(float out, float *in, int num_el)

__shared__ float minimum[NUM_THREADS];

int tid = threadIdx.x;

minimum[tid] = in[tid];

for (int k = tid + threadDim.x; k < num_el; k+= threadDim.x) {

minimum[tid] = min(minimum[tid], in[k]);


   if (NUM_THREADS>= 512) { if (tid < 256) { minimum[tid] = min(minimum[tid], in[tid+256]); } __syncthreads(); }

    if (NUM_THREADS>= 256) { if (tid < 128) { minimum[tid] = min(minimum[tid], in[tid+128]); } __syncthreads(); }

    if (NUM_THREADS>= 128) { if (tid <  64) { minimum[tid] = min(minimum[tid], in[tid+64]); } __syncthreads(); }


   if (tid < 32)


        if (NUM_THREADS>=  64) { minimum[tid] = min(minimum[tid], in[tid+32]); __syncthreads();}

        if (NUM_THREADS>=  32) { minimum[tid] = min(minimum[tid], in[tid+16]); __syncthreads();}

        if (NUM_THREADS>=  16) { minimum[tid] = min(minimum[tid], in[tid+8]); __syncthreads();}

        if (NUM_THREADS>=   8) { minimum[tid] = min(minimum[tid], in[tid+4]); __syncthreads();}

        if (NUM_THREADS>=   4) { minimum[tid] = min(minimum[tid], in[tid+2]); __syncthreads();}

        if (NUM_THREADS>=   2) { minimum[tid] = min(minimum[tid], in[tid+1]); __syncthreads();}



    // write result for this block to global mem 

    if (tid == 0) out = minimum[0];


Thanks, that sounds like a smart idea

Just NEVER forget volatile if you are doing reductions without __syncthreads() ! This cost me a few hours of debugging.