A question about a book code example - Koggle-Stone scan

Here’s a code from “Programming Massively Parallel Processors…”:

for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
  if (threadIdx.x >= stride)
    XY[threadIdx.x] += XY[threadIdx.x - stride];

My question is - XY array is updated in-place, so what guarantees that for all threads XY reads are happening before XY writes?

My first guess was that warps are executed as SIMD, so within a warp we can rely on it.
But what about across warps?