Synchronization across Multiple blocks Is there any way to call sync threads across mutiple blocks??

Praveen_PVS · January 19, 2010, 5:42am

Hello every one

In my code, there are multiple blocks, each block finds the maximum .
After each block finds its local maximum, I want one of the blocks to find the final maximum of all the local maximums.
The code is like reduction operation.
Each block writes its local maximum in the global memory.
I want one of the blocks to find the final maximum among all local maximums.
Gpu hardware does not support for communication across blocks.
How to make sure that all blocks finished writing its local maximum in the global memory??
I tried to search on net, but it is not clear for me.
Please help me in solving this.
I will be grateful to you for this help.
Thank you very much.

With love and regards
Praveen.

SPWorley · January 19, 2010, 7:25am

The specific example which covers almost exactly what you’re asking for is available in the CUDA SDK. It’s the threadFenceReduction example.

From its comments:

// This reduction kernel reduces an arbitrary size array in a single kernel invocation

// It does so by keeping track of how many blocks have finished.  After each thread

// block completes the reduction of its own block of data, it "takes a ticket" by

// atomically incrementing a global counter.  If the ticket value is equal to the number

// of thread blocks, then the block holding the ticket knows that it is the last block

// to finish.  This last block is responsible for summing the results of all the other

// blocks.

//

// In order for this to work, we must be sure that before a block takes a ticket, all 

// of its memory transactions have completed.  This is what __threadfence() does -- it

// blocks until the results of all outstanding memory transactions within the 

// calling thread are visible to all other threads.

//

// For more details on the reduction algorithm (notably the multi-pass approach), see

// the "reduction" sample in the CUDA SDK.

But also note, if you’re just trying to accumulate a maximum over blocks, then you can use the one-line atomicMax. Normally atomics are expensive and should be used sparingly, but the overhead of a single atomic per block is negligible.

The code then becomes very easy. There’s a sight complication if you’re finding the max of a float, though, you need to do a little bit-level transform to make it sortable (a trick also used by the radix sort). atomic ops only work on integers.

Topic		Replies	Views
Reduction for a maximum value for all threads? CUDA Programming and Performance	3	811	August 1, 2011
Synchronize all blocks in CUDA CUDA Programming and Performance	12	46913	October 25, 2013
concurrent memory writes CUDA Programming and Performance	8	5632	September 15, 2008
Advancing Computed Values.. Help CUDA Programming and Performance	4	815	August 16, 2011
sum up blockresults with last thread-block? CUDA Programming and Performance	2	2412	July 7, 2008
thread synchronization in reduce Teaching & Curriculum Support	2	1280	January 19, 2014
Find maximum value from threads CUDA Programming and Performance	6	584	December 16, 2023
Incorrect result while using shared memory to get maximum value CUDA Programming and Performance	3	451	November 20, 2021
syncronize all threads from all blocks cudaThreadSynchronize() the only way ? CUDA Programming and Performance	11	8388	November 15, 2010
interblock sync without __threadfence() ? CUDA Programming and Performance	17	8657	May 7, 2009

Synchronization across Multiple blocks Is there any way to call sync threads across mutiple blocks??

Related topics