Why __syncwarp is necessary in undivergent warp reduction?

I have read a technique blog post by nvidia employment as blow:
using-cuda-warp-level-primitives

I don’t understand why __syncwarp() is necessary in listing 8.

The article said:
“The CUDA programming model does not guarantee that all the reads will be performed before all the writes.”

I come up with a situation where warp-level sync is necessary, list blow:

//Assume only 32 threads in a block.
volatile __shared__ int data[32];
if (threadIdx.x % 2 == 0) {
       //Do something
} else {
      //Do something
}

data[threadIdx.x] = threadIdx.x;
int value = data[(threadIdx.x + 1) % 32];

//Write value to global memory

In the code above, as hardware may not re-convergence after else, the threads in same warp may not execute the same instruction. So the final result is undefined.

But what if there is no warp-level divergency? Is it necessary to add __syncwarp()? Assume shared memory array is decorated by volatile.

//Assume only 32 threads in a block.
__global__ void kernel(int *out, int *in) {
      volatile __shared__ int data[64];
      data[threadIdx.x] = in[threadIdx.x];
      //Do I need __syncwarp()?
      data[threadIdx.x] += data[threadIdx.x + 16];
      data[threadIdx.x] += data[threadIdx.x + 8];
      data[threadIdx.x] += data[threadIdx.x + 4];
      data[threadIdx.x] += data[threadIdx.x + 2];
      data[threadIdx.x] += data[threadIdx.x + 1];

      //data[0] should be sum of in[0..31]
       out[threadIdx.x] = data[0];
}

If it’s still necessary, please tell me why some threads may not read data written by other threads.

How do you know that?

The Volta execution model is allowed to move threads forward in any order (even within a warp, even in the absence of conditional code), unless you impose order.

Could you give an example where there exist warp-level divergency but there is no conditional code?

No I can’t. You’re welcome to code as you wish, of course. But I can imagine a case where there is diverged warp activity in Volta where you may not expect it.

if (threadIdx.x < 1) statementA;
statementB;

Is the warp diverged at the point of statementB? I don’t know, on Volta.

The argument that is associated with the modification of the code from listing 7 to listing 8 is predicated on the idea that the volta execution model allows threads in a warp to race ahead. There are no statements or guarantees in the volta execution model as to when this will not occur (AFAIK). You seem to be wanting to stipulate or postulate a case when the volta execution model, which allows threads to proceed at varying rates, even within a warp, does not apply.

At the risk of further confusing things, your listing is really comparable to listing 7, not listing 8, although you asked about listing 8.

I understand why there maybe diverged warp activity in statement B. It’s caused by the new feature introduced in Volta called independent thread scheduling. Because there is a if statement in the first line and warp is not guaranteed to re-convergence in line 2.
But my question is: If there is no code divergence before exchange data through shared memory, is it necessary to use __syncwarp?

The blog listings 7 and 8 have no conditional code, yet are indicating __syncwarp() is necessary. I don’t think that was a mistake. The only plausible reason for that is that the Volta execution model makes exactly the (lack of) guarantee that I stated.

The documentation has these excerpts:

However, this can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity of previous hardware architectures.

You are wishing to make an assumption about warp-synchronicity. The Volta model breaks those assumptions.

This code is invalid because CUDA does not guarantee that the warp will diverge ONLY at the loop condition.

The Volta model does not give the programmer the ability to determine where a warp may diverge.

Again, you are welcome to code as you wish.

Thanks a lot.