Suppose thread A writes something to shared memory, and thread B wants to read it. Now, if thread A and B are in different warps, it’s obvious that a __syncthreads() is needed, since who knows where in the code each warp will be relative to the other. The question then is assuming that A and B are in the same warp - can threads all in a single warp communicate through shared memory without needing __syncthreads()?
Let’s make it even simpler and assume no possibility of divergence between the writing and the reading, thus avoiding any strange reordering of if/else blocks where each side is executed. What about in this case?
Basically, I want each warp to share data internally without dragging the entire block through a sync.