Synchronization among blocks

Hello all,

I tried to implement scan primitive(prefix sum) for arbitrary size of array.
I implemented that in a kernel. Each blocks does prefix sum for allocated segment than block 0 performs block sum using last element on each segment.

To synchronize blocks, I used “__syncthreads()”. But I realized that it does not work.
In my intuition, “__syncthreads()” only guarantees synchronization among thread in a block.
Am I right?

Then, is there any way to synchronize blocks except launching another kernels?

Thanks.
DS

No, there is not.

Tim,
If one can make sure that “all” the blocks run together (i.e. active blocks == total blocks) , cannot one make use of the “atomic Primitives” to implement a barrier or sthg? May be possible… but is it supported by NV?
Thanks,

That’s a bold assumption, which is usually not true. But if you do satisfy that assumption, I believe you can write some inter-block barrier.

There is a paper “Inter-Block GPU Communication via Fast Barrier Synchronization” talking about this type of implementation

Yeah… Thats bold… after all I am a very bold man ;-)

I am interested to know if NVIDIA will support it in future… Such type of coding assumes “a thing or two” about the way the hardware schedules these blocks… So, I was wondering what NV says about it…

From the CUDA programming manual:

“Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores as illustrated by Figure 1-4, enabling programmers to write code that scales with the number of cores.”

In other words, every block must be able to operate independently at any point during execution–that is, your kernel must remain valid for every scheduling even if all blocks run concurrently until some arbitrary point during execution, at which point all blocks are serialized in an undefined order. For example, if you write a kernel where there exists one point where block 1 must run before block 2 to prevent deadlock (say when block 1 acquires a lock that block 2 also needs), that kernel is not guaranteed to work because we make no guarantees whatsoever that block 1 will run before block 2 at any arbitrary point. We may context switch to block 2 and just stay there. This doesn’t mean that we do schedule blocks this way, just that it is not guaranteed that we don’t.

If you release code like this, it will break in weird and unexpected ways.

Tim,

Thanks! That makes lotta sense… So, global block barrier et al… dont make any “real” sense… Thanks,

So, Can we say that efforts like “http://eprints.cs.vt.edu/archive/00001087/01/TR_GPU_synchronization.pdf” are not really for “production envmts” ???

An excerpt:
"
5. Proposed GPU Synchronization
Since in CUDA programming model, the execution of a thread
block is non-preemptive, care must be taken to avoid dead locks in
GPU synchronization design. Consider a scenario where multiple
thread blocks are mapped to one SM and the active block is waiting
for the completion of a global barrier. A deadlock will occur in
this case because the unscheduled thread blocks will not be able to
reach the barrier without preemption. Our solution to this problem
is to have an one-to-one mapping between thread blocks and SMs.
In other words, for a GPU with ‘Y’ SMs, we ensure that at most ‘Y’
blocks are used in the kernel. In addition, we allocate all available
shared memory on a SM to each block so that no two blocks can be
scheduled to the same SM because of the memory constraint.
In the following discussion, we will present three alternative
GPU synchronization designs: GPU simple synchronization, GPU
tree-based synchronization, GPU lock-free synchronization. The
first two are lock-based designs that make use of mutex variables
and CUDA atomic operations. The third design uses a lock-free algorithm
that avoids the use of expensive CUDA atomic operations.
"

Correct. While I admire them for their ingenuity, I wouldn’t deploy such a thing.

Tim,

Thanks a lot!

I received a review comment on my paper that “blocks can be sychronized”… Thats why I wanted to know what the official stand is… Thanks a lot for getting back,

Best Regards,
Sarnath