Why __syncwarp is necessary in undivergent warp reduction?

SparkHu · March 31, 2022, 7:07am

I have read a technique blog post by nvidia employment as blow:
using-cuda-warp-level-primitives

I don’t understand why __syncwarp() is necessary in listing 8.

The article said:
“The CUDA programming model does not guarantee that all the reads will be performed before all the writes.”

I come up with a situation where warp-level sync is necessary, list blow:

//Assume only 32 threads in a block.
volatile __shared__ int data[32];
if (threadIdx.x % 2 == 0) {
       //Do something
} else {
      //Do something
}

data[threadIdx.x] = threadIdx.x;
int value = data[(threadIdx.x + 1) % 32];

//Write value to global memory

In the code above, as hardware may not re-convergence after else, the threads in same warp may not execute the same instruction. So the final result is undefined.

But what if there is no warp-level divergency? Is it necessary to add __syncwarp()? Assume shared memory array is decorated by volatile.

//Assume only 32 threads in a block.
__global__ void kernel(int *out, int *in) {
      volatile __shared__ int data[64];
      data[threadIdx.x] = in[threadIdx.x];
      //Do I need __syncwarp()?
      data[threadIdx.x] += data[threadIdx.x + 16];
      data[threadIdx.x] += data[threadIdx.x + 8];
      data[threadIdx.x] += data[threadIdx.x + 4];
      data[threadIdx.x] += data[threadIdx.x + 2];
      data[threadIdx.x] += data[threadIdx.x + 1];

      //data[0] should be sum of in[0..31]
       out[threadIdx.x] = data[0];
}

If it’s still necessary, please tell me why some threads may not read data written by other threads.

Robert_Crovella · March 31, 2022, 3:58pm

How do you know that?

The Volta execution model is allowed to move threads forward in any order (even within a warp, even in the absence of conditional code), unless you impose order.

SparkHu · March 31, 2022, 4:03pm

Could you give an example where there exist warp-level divergency but there is no conditional code?

Robert_Crovella · March 31, 2022, 4:05pm

No I can’t. You’re welcome to code as you wish, of course. But I can imagine a case where there is diverged warp activity in Volta where you may not expect it.

if (threadIdx.x < 1) statementA;
statementB;

Is the warp diverged at the point of statementB? I don’t know, on Volta.

The argument that is associated with the modification of the code from listing 7 to listing 8 is predicated on the idea that the volta execution model allows threads in a warp to race ahead. There are no statements or guarantees in the volta execution model as to when this will not occur (AFAIK). You seem to be wanting to stipulate or postulate a case when the volta execution model, which allows threads to proceed at varying rates, even within a warp, does not apply.

At the risk of further confusing things, your listing is really comparable to listing 7, not listing 8, although you asked about listing 8.

SparkHu · March 31, 2022, 4:14pm

I understand why there maybe diverged warp activity in statement B. It’s caused by the new feature introduced in Volta called independent thread scheduling. Because there is a if statement in the first line and warp is not guaranteed to re-convergence in line 2.
But my question is: If there is no code divergence before exchange data through shared memory, is it necessary to use __syncwarp?

Robert_Crovella · March 31, 2022, 4:38pm

The blog listings 7 and 8 have no conditional code, yet are indicating __syncwarp() is necessary. I don’t think that was a mistake. The only plausible reason for that is that the Volta execution model makes exactly the (lack of) guarantee that I stated.

The documentation has these excerpts:

However, this can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity of previous hardware architectures.

You are wishing to make an assumption about warp-synchronicity. The Volta model breaks those assumptions.

This code is invalid because CUDA does not guarantee that the warp will diverge ONLY at the loop condition.

The Volta model does not give the programmer the ability to determine where a warp may diverge.

Again, you are welcome to code as you wish.

SparkHu · April 1, 2022, 2:01am

Thanks a lot.

Topic		Replies	Views
is syncthreads needed when will divergent threads in same warp re-sync CUDA Programming and Performance	9	3394	January 23, 2012
Confusion about __syncwarp() if all threads in a warp are automatically in sync? CUDA Programming and Performance	3	2046	March 10, 2020
about the __syncwarp() in P100 CUDA Programming and Performance	11	4261	June 6, 2018
Requesting clarification - CUDA WARP level primitives and THREAD divergence CUDA Programming and Performance hw , cuda	2	1275	February 14, 2024
A question about calculatePartialSum sample code in CUDA_C programming guide CUDA Programming and Performance	13	937	September 29, 2023
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	17095	January 5, 2008
Is syncthreads required within a warp? CUDA Programming and Performance	10	12761	November 8, 2013
Race condition within warp CUDA Programming and Performance	9	3198	September 20, 2016
global to shared mem loads and sync CUDA Programming and Performance	26	11855	February 21, 2008
order of execution in a divergent warp CUDA Programming and Performance	27	4893	January 17, 2012

Why __syncwarp is necessary in undivergent warp reduction?

Related topics