Logical OR pattern

What will happen if I attempt to implement a logical OR operation in a single pass rather than using a reduction (i.e. I want to get the result in order constant time instead of log(n) time).

The idea is to make all of the threads execute the following (somewhat simplified) code:

[codebox]

global void logical_or(int *v)

{

// v[0] = OR(v)

// if any element of v is nonzero, set v[0] to 1.

const unsigned int tid = blockDim.x*blockIdx.x + threadIdx.x;

if (v[tid]) {

    v[0] = 1;

}

}

[/codebox]

The obvious issue is that a large number of writes of the same value to the same memory location happen simultaneously.

If it is the case that simultaneously writing the same value to the same location will reliably result in that value being written correctly, then this pattern will work.

My question, then, is whether or not (on all CUDA compliant hardware) simultaneous writes of identical values will reliably cause the expected result.

  • Ken

The CUDA programming guide states that among all simultaneous writes within a warp, at least one (which one is undefined) will succeed. It mentions nothing about what occurs among all such writes in a grid. However, I have been using such a memory write pattern in a production app for years and have never noticed any strange results.

Cool! It seems like a very powerful pattern for a lot of situations. I gave only a simple example, but it appears that it could be used in many iterative algorithms where you want to do something conditionally when all elements achieve a certain condition.

Thanks for your reply.

  • Ken

Yes, that does indeed work.

However it has a performance problem… it turns out that the colliding writes cause bank conflicts! So your writes become serialized and you’re certainly not getting any speed advantage from it… it will take you N clock ticks if every thread writes, far slower than a reduction.

Perhaps a way to minimize the collisions is to do a two-stage test based on warps.

If you know your threads are not diverged, then something like this would work:

if (__any(v[tid]))

   if (0==tid&31) v[0]=1; // only one thread from the warp actually writes

__syncthreads();

This will prevent all 32 threads from a warp from gangraping the poor v[0]… you get just one single write, avoiding the serialization.

So for a thread count of say 256, how fast would they be in the best case and worst case?

Best case is when no thread needs to write. Worst case is when every thread needs to write.

[list=1]

[*]Reduction: (8+4+2+1+1+1+1+1)*2=36 clocks, in all cases

[*]Gang rape of v[0]: 8 clocks best case, 256 clocks worst case (due to serialization)

[]Two stage test: 8 clocks best case. 8+38=32 clocks worst case.

Somehow I think my compute 1.0 and 1.1 devices won’t like it ;)

Yeah true… but still the need to minimize the memory access is needed…
So, instead, all threads could write first to a single shared memory variable and then finally the result could be conditionally written to V[0]… The same logic… but would work on compute 1.0 as well.

Ok, then a different strategy works.

For compute 1.0, you could have every thread write to v[tid&31]. No bank conflicts. You then boil the 32 values down to 1 using a single warp doing 5 stages of reduction.

Best case and worst case both take the same number of clocks: 8+2*5=18.

In fact you could write to v[tid&15] instead and shave off one reduction step, reducing it to 16 clocks.

+400 to 600 clocks for global memory access :-)

Ah, I assumed v was in shared memory… so you need to use 16 words of shared as a scratch array.