Is there warp divergence in reduce0 kernel which is implemented in the CUDA sample Reduction?

Summer_2020 · January 7, 2020, 4:43pm

I am the new hands of CUDA optimization. And I am learning to how to optimize parrallel reduction in CUDA, the reduction project in CUDA sample code is a good example for me. But in the first version kernel function-reduce0,there is two warp divergence code segments theoretically.The one is the if condition in the for loop. The other one is the last if condition which copy sum result to the output pointer.I pasted the kernel function below. But there is not warp divergence in the for loop.This doesn’t conform to my mental model.For example, in thread0,(tid % (2s)) == 0; but in thread1,(tid % (2s)) == 1.And thread 0 and thread1 are in the same warp.so here will cause warp divergence. Am i right? if not, which part is wong.Thank you in advance.
kernel function Reduce0 is here.

template
global void
reduce0(T *g_idata, T *g_odata, unsigned int n)
{
T *sdata = SharedMemory();

// load shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = (i < n) ? g_idata[i] : 0;

__syncthreads();

// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2)
{
    // modulo arithmetic is slow!
    if ((tid % (2*s)) == 0)   ///is warp divergence?
    {
        sdata[tid] += sdata[tid + s];
    }

    __syncthreads();
}

// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];//divergence

}

Robert_Crovella · January 7, 2020, 4:49pm

yes there is warp divergence in the for loop

some threads in a warp execute the body of the if statement, some do not.

Since the if statement is within the body of the for loop, there is (generally speaking) warp divergence during execution of the for loop.

Summer_2020 · January 8, 2020, 1:31am

Thanks Robert for the rapid and detailed answer. Here is the second question about the warp divergence in this example.I ran this code in NVIDIA visual profiler, but there isnot warp divergence in the for loop.My gpu is Quadro M1000M.I attached the screenshot of the profiler here.
Thanks a lot.

Robert_Crovella · January 8, 2020, 1:45am

I was using “warp divergence” in a rather casual way, to suggest that different threads in the warp would take different paths. The idea is that, at the C++ source code level, there is an “if” path and an “else” path, and some threads will take the “if” path and some will not. That’s what I had meant by warp divergence.

However, a more accurate definition, and the definition used by the profiler, is if the warp actually follows different SASS level code paths, depending on the thread in the warp. The first definition above is easy to identify from the C++ source code level. However this definition is not, due to the possibility that the compiler will use predication to handle simple conditional behavior.

predication is not a C++ concept, it is something that is defined at the SASS or PTX level in CUDA. Predication uses special predicate registers which hold only a boolean value (true/false). These registers are set using a boolean test of some sort (greater than, less than, etc.)

At the SASS (or PTX) level, an instruction can be conditionally executed based on a per-thread predicate register value. Since the value of the per-thread predicate register may vary across the warp (just like any other register) we may observe “different threads doing different things” in the presence of conditional code.

That means that all threads will get to that instruction, in lock-step, with the same program counter value, but not all threads will apply the result of the instruction, depending on the value of their predicate register.

Simple conditional activity is handled this way at the SASS level (according to the compiler’s choosing), and it does not involve warp divergence (according to this more accurate description). So the profiler does not report any warp divergence.

You can read more about predication in the PTX manual, and there are plenty of questions about predication on various forums.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instruction-statements

So with that preamble, I retract my previous statement and replace it with:

“There is evidently no warp divergence in that particular case.”

We could also say:

“There is a potential for warp divergence there.”

Summer_2020 · January 8, 2020, 3:20am

Thanks a lot.

Topic		Replies	Views
Shift direction and divergence CUDA Programming and Performance	7	380	November 13, 2020
reduction optimization #1 Not agree with performances explanation CUDA Programming and Performance	8	6664	August 1, 2008
global to shared mem loads and sync CUDA Programming and Performance	26	11440	February 21, 2008
[Solved] PTX ISA predicated execution and the warp divergence issue CUDA Programming and Performance	6	2955	January 14, 2014
Is there efficient way to deal with if/else in the kernel CUDA Programming and Performance	4	13791	June 14, 2009
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21233	March 12, 2007
Thread divergence when block size is equal to warp size CUDA Programming and Performance	2	596	June 5, 2019
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	2925	December 28, 2008
Thread divergence due to IF CUDA Programming and Performance	3	6853	September 13, 2007
threads diverging in a loop when does a loop cause divergance CUDA Programming and Performance	13	20909	May 12, 2007

Is there warp divergence in reduce0 kernel which is implemented in the CUDA sample Reduction?

Related topics