I am the new hands of CUDA optimization. And I am learning to how to optimize parrallel reduction in CUDA, the reduction project in CUDA sample code is a good example for me. But in the first version kernel function-reduce0,there is two warp divergence code segments theoretically.The one is the if condition in the for loop. The other one is the last if condition which copy sum result to the output pointer.I pasted the kernel function below. But there is not warp divergence in the for loop.This doesn’t conform to my mental model.For example, in thread0,(tid % (2s)) == 0; but in thread1,(tid % (2s)) == 1.And thread 0 and thread1 are in the same warp.so here will cause warp divergence. Am i right? if not, which part is wong.Thank you in advance.
kernel function Reduce0 is here.
template
global void
reduce0(T *g_idata, T *g_odata, unsigned int n)
{
T *sdata = SharedMemory();
// load shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = (i < n) ? g_idata[i] : 0;
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2)
{
// modulo arithmetic is slow!
if ((tid % (2*s)) == 0) ///is warp divergence?
{
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];//divergence
}