Is there warp divergence in reduce0 kernel which is implemented in the CUDA sample Reduction?

I am the new hands of CUDA optimization. And I am learning to how to optimize parrallel reduction in CUDA, the reduction project in CUDA sample code is a good example for me. But in the first version kernel function-reduce0,there is two warp divergence code segments theoretically.The one is the if condition in the for loop. The other one is the last if condition which copy sum result to the output pointer.I pasted the kernel function below. But there is not warp divergence in the for loop.This doesn’t conform to my mental model.For example, in thread0,(tid % (2s)) == 0; but in thread1,(tid % (2s)) == 1.And thread 0 and thread1 are in the same here will cause warp divergence. Am i right? if not, which part is wong.Thank you in advance.
kernel function Reduce0 is here.

global void
reduce0(T *g_idata, T *g_odata, unsigned int n)
T *sdata = SharedMemory();

// load shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = (i < n) ? g_idata[i] : 0;


// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2)
    // modulo arithmetic is slow!
    if ((tid % (2*s)) == 0)   ///is warp divergence?
        sdata[tid] += sdata[tid + s];


// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];//divergence


yes there is warp divergence in the for loop

some threads in a warp execute the body of the if statement, some do not.

Since the if statement is within the body of the for loop, there is (generally speaking) warp divergence during execution of the for loop.

Thanks Robert for the rapid and detailed answer. Here is the second question about the warp divergence in this example.I ran this code in NVIDIA visual profiler, but there isnot warp divergence in the for loop.My gpu is Quadro M1000M.I attached the screenshot of the profiler here.
Thanks a lot.

I was using “warp divergence” in a rather casual way, to suggest that different threads in the warp would take different paths. The idea is that, at the C++ source code level, there is an “if” path and an “else” path, and some threads will take the “if” path and some will not. That’s what I had meant by warp divergence.

However, a more accurate definition, and the definition used by the profiler, is if the warp actually follows different SASS level code paths, depending on the thread in the warp. The first definition above is easy to identify from the C++ source code level. However this definition is not, due to the possibility that the compiler will use predication to handle simple conditional behavior.

predication is not a C++ concept, it is something that is defined at the SASS or PTX level in CUDA. Predication uses special predicate registers which hold only a boolean value (true/false). These registers are set using a boolean test of some sort (greater than, less than, etc.)

At the SASS (or PTX) level, an instruction can be conditionally executed based on a per-thread predicate register value. Since the value of the per-thread predicate register may vary across the warp (just like any other register) we may observe “different threads doing different things” in the presence of conditional code.

That means that all threads will get to that instruction, in lock-step, with the same program counter value, but not all threads will apply the result of the instruction, depending on the value of their predicate register.

Simple conditional activity is handled this way at the SASS level (according to the compiler’s choosing), and it does not involve warp divergence (according to this more accurate description). So the profiler does not report any warp divergence.

You can read more about predication in the PTX manual, and there are plenty of questions about predication on various forums.

So with that preamble, I retract my previous statement and replace it with:

“There is evidently no warp divergence in that particular case.”

We could also say:

“There is a potential for warp divergence there.”

Thanks a lot.