Respectfully, if warp-synchronous programming weren’t safe then I think you would find that a large number of kernels would be failing today.
I suspect many CUDA developers aren’t even aware they’re depending on lane coherency. They have dutifully followed the advice found in the various CUDA programming guides and qualified their shared memory with the volatile keyword and things “just work”.
I see no benefit in avoiding declaring exactly which parts of a programming model are concrete and which are idiomatic. Promoting a vague programming model (“it’s dangerous”) doesn’t benefit strong reasoning about programs – or debugging those programs.
Generations of CUDA docs have been very clear in declaring that shared loads can be optimized by the compiler… and how use of the volatile qualifier halts shared load/store optimizations.
The docs are also clear that warp-synchronous programming is a valid approach. I take the updated warnings in the Kepler Tuning Guide as hints that should be internalized and to get ready to stop using this part of the sm_10-sm_35 programming model.
You have no argument from me that warp-synchronous programming can be subtle but it’s a building block that we probably have all been relying on whether we’re ninjas or tyros.
Can’t wait to see that next generation architecture…