Unrolling warps

Hello,
Im reading “Optimizing parallel reduction in cuda” by Mark Harris
http://developer.dow…c/reduction.pdf
and im tring to understand why “Unroll last warp saves useless work in ALL warps,not jus the last one”, (Reduction #5)
How is it saves usless work for the other warps?
can you please advise?

The quote is regarding the next code:

device void warpReduce(volatile int* sdata, int tid){
sdata[tid]+=sdata[tid+32];
sdata[tid]+=sdata[tid+16];
sdata[tid]+=sdata[tid+8];
sdata[tid]+=sdata[tid+4];
sdata[tid]+=sdata[tid+2];
sdata[tid]+=sdata[tid+1];
}

for(int s=blockDim.x/2;s>32;s>>=1){
if(tid<s)
sdata[tid]+=sdata[tid+s];
__syncthreades();
}
if(tid<32)warpReduce(sdata,tid);

Thanks!

If you look at the loop before the optimization you will notice that the loop runs until “s>0” and if you look after the optimization it runs to “s>32”. This results in 5 less iteration in the loop portion. And since all the warps execute the loop portion, this saves work because all the warps would have 5 fewer iterations. I think this is what Mark Harris is referring to.

Before…

for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

  if (tid < s)

    sdata[tid] += sdata[tid + s];

  __syncthreads();

}

After…

for(unsigned int s=blockDim.x/2; s>32; s>>=1) {

  if (tid < s)

    sdata[tid] += sdata[tid + s];

  __syncthreads();

}

if (tid < 32)warpReduce(sdata, tid);

why is that ‘volatile’ necessary?

Because from the perspective of each single thread the contents of [font=“Courier New”]sdata[/font] can change outside of that thread’s control, and no barriers ([font=“Courier New”]__syncthreads()[/font]) are used.