I tried to implement scan primitive(prefix sum) for arbitrary size of array.
I implemented that in a kernel. Each blocks does prefix sum for allocated segment than block 0 performs block sum using last element on each segment.
To synchronize blocks, I used “__syncthreads()”. But I realized that it does not work.
In my intuition, “__syncthreads()” only guarantees synchronization among thread in a block.
Am I right?
Then, is there any way to synchronize blocks except launching another kernels?
Tim,
If one can make sure that “all” the blocks run together (i.e. active blocks == total blocks) , cannot one make use of the “atomic Primitives” to implement a barrier or sthg? May be possible… but is it supported by NV?
Thanks,
Yeah… Thats bold… after all I am a very bold man ;-)
I am interested to know if NVIDIA will support it in future… Such type of coding assumes “a thing or two” about the way the hardware schedules these blocks… So, I was wondering what NV says about it…
“Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores as illustrated by Figure 1-4, enabling programmers to write code that scales with the number of cores.”
In other words, every block must be able to operate independently at any point during execution–that is, your kernel must remain valid for every scheduling even if all blocks run concurrently until some arbitrary point during execution, at which point all blocks are serialized in an undefined order. For example, if you write a kernel where there exists one point where block 1 must run before block 2 to prevent deadlock (say when block 1 acquires a lock that block 2 also needs), that kernel is not guaranteed to work because we make no guarantees whatsoever that block 1 will run before block 2 at any arbitrary point. We may context switch to block 2 and just stay there. This doesn’t mean that we do schedule blocks this way, just that it is not guaranteed that we don’t.
If you release code like this, it will break in weird and unexpected ways.
An excerpt:
"
5. Proposed GPU Synchronization
Since in CUDA programming model, the execution of a thread
block is non-preemptive, care must be taken to avoid dead locks in
GPU synchronization design. Consider a scenario where multiple
thread blocks are mapped to one SM and the active block is waiting
for the completion of a global barrier. A deadlock will occur in
this case because the unscheduled thread blocks will not be able to
reach the barrier without preemption. Our solution to this problem
is to have an one-to-one mapping between thread blocks and SMs.
In other words, for a GPU with ‘Y’ SMs, we ensure that at most ‘Y’
blocks are used in the kernel. In addition, we allocate all available
shared memory on a SM to each block so that no two blocks can be
scheduled to the same SM because of the memory constraint.
In the following discussion, we will present three alternative
GPU synchronization designs: GPU simple synchronization, GPU
tree-based synchronization, GPU lock-free synchronization. The
first two are lock-based designs that make use of mutex variables
and CUDA atomic operations. The third design uses a lock-free algorithm
that avoids the use of expensive CUDA atomic operations.
"
I received a review comment on my paper that “blocks can be sychronized”… Thats why I wanted to know what the official stand is… Thanks a lot for getting back,