Short answer:
Use explicit synchronization if you want your code to be safe (i.e. portable across future HW and
future driver releases).
Long answer:
At the programming model level, you must use explicit synchronization to be safe.
Programs with inter-thread communication without explicit synchronization have
undefined behavior because they have data races.
CUDA defines a multi-threaded programming model where programming model threads are
mapped onto the SM datapaths by a combination of hardware, compiler, and system software
schedulers. A goal of these schedulers is to maintain convergence (because it generally improves
performance), but in general, there are multiple possible schedules with different performance
tradeoffs, and in some cases different schedules will result in different convergence behavior.
It is impossible to describe the exact situations under which convergence will occur without describing
the implementation of the hardware, compiler, and system software schedulers. Furthermore, the
compiler and system software schedulers are subject to change between driver releases. Clearly,
the hardware schedulers are subject to change between architectures.
Currently in CUDA, you have access to __syncthreads(), which gives you barrier synchronization
(and the implied program and memory ordering) among threads in a thread block, but no finer
grained synchronization.
It is true that __syncthreads isn’t really the best tool to perform fine-grained synchronization
(e.g. among a group of 32 threads in a large thread block), and this
becomes more apparent in parallel algorithms that require hierarchical synchronization (e.g.
reductions or prefix sums).
It is also the case that the convergence optimizations are often effective, giving the appearence
of an implicit barrier among threads in a warp between successive instructions on some GPUs.
The reason why __syncthreads works is because the programmer asserts that all threads will reach it, and
the hardware, compiler, and runtime cooperate to enforce the barrier. Without the assertion, the system
cannot know that all threads will reach a given location in the program (see the Halting Problem), and therefore
if can only enforce convergence opportunistically. Furthermore, without explicit synchronization the system
does not know to enforce the synchronization order (control and memory ordering) among a group of threads.
There are numerous optimizations that break this order and the system needs to know where it is safe to
apply them. The alternative is to turn them off all the time and penalize all applications.
Finer-grained synchronization than __syncthreads would require a similar (explicit) mechanism to be safe.
If this is an important use-case for your application, I would encourage you to file an RFE on the
registered developer site for explicit fine-grained synchronization operations.