is synchronization implied within wasp blocks?

Hello,

For simplicity, argue that the intended program/ algorithm fits a wasp block, such that overall desired thread synchronization is mostly confined to a wasp block

If I am not mistaken, the SM schedules instructions per wasp block - SIMT
So, if the code for the wasp block branches, such that threads diverge and later again converge, would SIMT automatically infer thread synchronization, or must one still explicitly use synchronization calls like __syncthreads()?

A sum scan of an array of 32 (wasp block size) elements over the wasp block would be the perfect example - all threads must finish to get the eventual result, and will do so at different times
Should one place __syncthreads() calls in the code, or not?

Short answer:

Use explicit synchronization if you want your code to be safe (i.e. portable across future HW and
future driver releases).

Long answer:

At the programming model level, you must use explicit synchronization to be safe.
Programs with inter-thread communication without explicit synchronization have
undefined behavior because they have data races.

CUDA defines a multi-threaded programming model where programming model threads are
mapped onto the SM datapaths by a combination of hardware, compiler, and system software
schedulers. A goal of these schedulers is to maintain convergence (because it generally improves
performance), but in general, there are multiple possible schedules with different performance
tradeoffs, and in some cases different schedules will result in different convergence behavior.

It is impossible to describe the exact situations under which convergence will occur without describing
the implementation of the hardware, compiler, and system software schedulers. Furthermore, the
compiler and system software schedulers are subject to change between driver releases. Clearly,
the hardware schedulers are subject to change between architectures.

Currently in CUDA, you have access to __syncthreads(), which gives you barrier synchronization
(and the implied program and memory ordering) among threads in a thread block, but no finer
grained synchronization.

It is true that __syncthreads isn’t really the best tool to perform fine-grained synchronization
(e.g. among a group of 32 threads in a large thread block), and this
becomes more apparent in parallel algorithms that require hierarchical synchronization (e.g.
reductions or prefix sums).

It is also the case that the convergence optimizations are often effective, giving the appearence
of an implicit barrier among threads in a warp between successive instructions on some GPUs.

The reason why __syncthreads works is because the programmer asserts that all threads will reach it, and
the hardware, compiler, and runtime cooperate to enforce the barrier. Without the assertion, the system
cannot know that all threads will reach a given location in the program (see the Halting Problem), and therefore
if can only enforce convergence opportunistically. Furthermore, without explicit synchronization the system
does not know to enforce the synchronization order (control and memory ordering) among a group of threads.
There are numerous optimizations that break this order and the system needs to know where it is safe to
apply them. The alternative is to turn them off all the time and penalize all applications.

Finer-grained synchronization than __syncthreads would require a similar (explicit) mechanism to be safe.

If this is an important use-case for your application, I would encourage you to file an RFE on the
registered developer site for explicit fine-grained synchronization operations.

Noted, thanks

For now, I shall make do with __syncthreads()

Algorithm implementation is as much determined by input data as anything else
In this particular variant, I am faced with an algorithm that computes relatively long, but on an array element size that is relatively small; I am divided whether I should deem this unconventional, or conventional (from a graphics point of view, likely unconventional; from a pure HPC point of view…?)
In such a case, to fully utilize the GPU, one might consider running instances of the algorithm in parallel, rather than having as many threads as possible work on only one instance of the algorithm - the former can achieve a far greater occupancy than the latter; and I believe this case can be generalized to HPC overall
I suspect future compute capabilities would see an increase in the number of blocks allowed per SM, which would help
If only a __syncwasp()…; now that would truly open doors…

Gregory:

I find this is easier said than done
Whenever the array element size is slightly smaller than the wasp block size, I either have to pad the array to synchronize at the sub-function level, or can only synchronize at the function level

I have induced from your reply that __syncthreads() is as much used by the compiler as anything else
I tried to keep the functions as small as possible, and have intermediate buffers in functions requiring synchronization - like functions implementing scans - to further aid the compiler

This seems to work