The specific example which covers almost exactly what you’re asking for is available in the CUDA SDK. It’s the threadFenceReduction example.
From its comments:
// This reduction kernel reduces an arbitrary size array in a single kernel invocation
// It does so by keeping track of how many blocks have finished. After each thread
// block completes the reduction of its own block of data, it "takes a ticket" by
// atomically incrementing a global counter. If the ticket value is equal to the number
// of thread blocks, then the block holding the ticket knows that it is the last block
// to finish. This last block is responsible for summing the results of all the other
// In order for this to work, we must be sure that before a block takes a ticket, all
// of its memory transactions have completed. This is what __threadfence() does -- it
// blocks until the results of all outstanding memory transactions within the
// calling thread are visible to all other threads.
// For more details on the reduction algorithm (notably the multi-pass approach), see
// the "reduction" sample in the CUDA SDK.
But also note, if you’re just trying to accumulate a maximum over blocks, then you can use the one-line atomicMax. Normally atomics are expensive and should be used sparingly, but the overhead of a single atomic per block is negligible.
The code then becomes very easy. There’s a sight complication if you’re finding the max of a float, though, you need to do a little bit-level transform to make it sortable (a trick also used by the radix sort). atomic ops only work on integers.