Is syncthreads required within a warp?

I understand that syncthreads is requried when sharing data within a block via shared memory. However, is it still required if I make sure that the sharing is limited within a warp? If I want to implement a parallel to warp shuffle for older architectures, using shared memory, do I need to use syncthreads?

There are two parts to my question. One is when the same thread writes and reads from shared memory. Another is when the writer and reader are differnt threads but belong to the same warp.

Any clarity regarding this will be appreciated.

If there is no divergence in the WARP, all threads of the WARP will execute the same instruction at the same time, so, you don’t need to synchronize at WARP level.

If two or more threads are writing to the same memory address, then you’ll get and undefined behavior, no matter whether they belong or not to the same WARP.

Be careful with that, because it’s probably that WARPSIZE != 32 in the next generations of Nvidia graphics cards.

Thanks a lot for the answer. To confirm my understanding, you are saying that if the writer to shared memory and the subsequent reader both belong to the same warp, there is no need for syncthreads after the write. So it possible to design an equivalent of warp shuffle without the use of syncthreads.

Exactly. Be sure each warp of the block is writing to its own portion of shared memory ;)

In general, you should be very careful about relying on warp synchronous semantics of existing GPUs and omitting synchronization (__syncthreads).

Programs that communicate between threads without synchronization (whether they are in the same warp or not) have race conditions. The way that threads are currently scheduled onto warps and the way that memory operations from the same warp are executed is somewhat deterministic and much of the time you won’t see any issues, but the process is affected by multiple factors including driver and compiler optimizations.

You should not assume that applications with races that happen to work on one GPU will continue working between different architectures or different driver versions. I have also seen cases where seemingly minor changes to unrelated parts of code will lead to different compiler optimizations being applied that expose the race conditions.

I’m curious about why you are considering doing this. Is syncthreads too slow? Do you want to perform synchronization on a warp granularity, and it is difficult to do that with syncthreads?

Gregory,

Lke you guessed, I am looking for synchronization at warp level granularity. Basically, an equivalent to warp shuffle (which also has warp level scope) for older architectures. Since syncthreads synchronizes across the entire block, it is a needless overhead, especially when occuring inside a loop that runs hundreds of times.

cuda threads fence applied on share memory has the same effect only that it does not do the sync. This safe option and maybe the overhead is not so large when is done on shared memory.

Implementing a warp shuffle equivalent in shared works perfectly for all current architectures. I use it all the time.

Warp synchronous programming is also safe across all current architectures.

Recent documentation hints that the warp synchronous guarantee might disappear in the future.

When using warp-synchronous techniques, keep in mind that there is no notion of “warp” at the CUDA source code level. Warp is a runtime concept, while the compiler deals strictly with compile-time constructs. The compiler’s view of the world is thus single-thread, except for those cases where it can prove that something is uniform across an entire thread block.

Ignoring the above background is a frequent source of broken warp-synchronous reduction code, for example. I have lost count how many times I have commented on such malfunctioning reductions in the past. I also spent two weeks once tracking down such an issue in a large third-party project.

Without __syncthreads() and in the absence of direct or potential data dependencies (in a single-thread world view!), the compiler is free to re-order loads from shared memory and keep load data in registers. The desire to schedule loads early in an aggressive fashion to help cover load latency in more recent GPU architectures is often what breaks incorrectly constructed warp-synchronous reductions that “worked” on older GPU architectures.

I will re-iterate my standard advice about warp-synchronous programming (with apologies to whoever coined the idiom with respect to optimizations; Knuth?):

(1) Don’t do it
(2) [Experts only] Don’t do it yet

Respectfully, if warp-synchronous programming weren’t safe then I think you would find that a large number of kernels would be failing today.

I suspect many CUDA developers aren’t even aware they’re depending on lane coherency. They have dutifully followed the advice found in the various CUDA programming guides and qualified their shared memory with the volatile keyword and things “just work”.

I see no benefit in avoiding declaring exactly which parts of a programming model are concrete and which are idiomatic. Promoting a vague programming model (“it’s dangerous”) doesn’t benefit strong reasoning about programs – or debugging those programs.

Generations of CUDA docs have been very clear in declaring that shared loads can be optimized by the compiler… and how use of the volatile qualifier halts shared load/store optimizations.

The docs are also clear that warp-synchronous programming is a valid approach. I take the updated warnings in the Kepler Tuning Guide as hints that should be internalized and to get ready to stop using this part of the sm_10-sm_35 programming model.

You have no argument from me that warp-synchronous programming can be subtle but it’s a building block that we probably have all been relying on whether we’re ninjas or tyros.

Can’t wait to see that next generation architecture…

Thanks for the feedback. Do you want synchronization that is specifically at the “warp” granularity, or do you just want some level of finer granularity than a CTA?