Hello,
Im reading “Optimizing parallel reduction in cuda” by Mark Harris http://developer.dow…c/reduction.pdf
and im tring to understand why “Unroll last warp saves useless work in ALL warps,not jus the last one”, (Reduction #5)
How is it saves usless work for the other warps?
can you please advise?
If you look at the loop before the optimization you will notice that the loop runs until “s>0” and if you look after the optimization it runs to “s>32”. This results in 5 less iteration in the loop portion. And since all the warps execute the loop portion, this saves work because all the warps would have 5 fewer iterations. I think this is what Mark Harris is referring to.
Before…
for(unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s)
sdata[tid] += sdata[tid + s];
__syncthreads();
}
After…
for(unsigned int s=blockDim.x/2; s>32; s>>=1) {
if (tid < s)
sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid < 32)warpReduce(sdata, tid);
Because from the perspective of each single thread the contents of [font=“Courier New”]sdata[/font] can change outside of that thread’s control, and no barriers ([font=“Courier New”]__syncthreads()[/font]) are used.