Synchronization across Multiple blocks Is there any way to call sync threads across mutiple blocks??

Hello every one

In my code, there are multiple blocks, each block finds the maximum .
After each block finds its local maximum, I want one of the blocks to find the final maximum of all the local maximums.
The code is like reduction operation.
Each block writes its local maximum in the global memory.
I want one of the blocks to find the final maximum among all local maximums.
Gpu hardware does not support for communication across blocks.
How to make sure that all blocks finished writing its local maximum in the global memory??
I tried to search on net, but it is not clear for me.
Please help me in solving this.
I will be grateful to you for this help.
Thank you very much.

With love and regards
Praveen.

The specific example which covers almost exactly what you’re asking for is available in the CUDA SDK. It’s the threadFenceReduction example.

From its comments:

// This reduction kernel reduces an arbitrary size array in a single kernel invocation

// It does so by keeping track of how many blocks have finished.  After each thread

// block completes the reduction of its own block of data, it "takes a ticket" by

// atomically incrementing a global counter.  If the ticket value is equal to the number

// of thread blocks, then the block holding the ticket knows that it is the last block

// to finish.  This last block is responsible for summing the results of all the other

// blocks.

//

// In order for this to work, we must be sure that before a block takes a ticket, all 

// of its memory transactions have completed.  This is what __threadfence() does -- it

// blocks until the results of all outstanding memory transactions within the 

// calling thread are visible to all other threads.

//

// For more details on the reduction algorithm (notably the multi-pass approach), see

// the "reduction" sample in the CUDA SDK.

But also note, if you’re just trying to accumulate a maximum over blocks, then you can use the one-line atomicMax. Normally atomics are expensive and should be used sparingly, but the overhead of a single atomic per block is negligible.

The code then becomes very easy. There’s a sight complication if you’re finding the max of a float, though, you need to do a little bit-level transform to make it sortable (a trick also used by the radix sort). atomic ops only work on integers.