Unrolling warps

zmha · October 30, 2011, 10:49am

Hello,
Im reading “Optimizing parallel reduction in cuda” by Mark Harris
http://developer.dow…c/reduction.pdf
and im tring to understand why “Unroll last warp saves useless work in ALL warps,not jus the last one”, (Reduction #5)
How is it saves usless work for the other warps?
can you please advise?

The quote is regarding the next code:

device void warpReduce(volatile int* sdata, int tid){
sdata[tid]+=sdata[tid+32];
sdata[tid]+=sdata[tid+16];
sdata[tid]+=sdata[tid+8];
sdata[tid]+=sdata[tid+4];
sdata[tid]+=sdata[tid+2];
sdata[tid]+=sdata[tid+1];
}

for(int s=blockDim.x/2;s>32;s>>=1){
if(tid<s)
sdata[tid]+=sdata[tid+s];
__syncthreades();
}
if(tid<32)warpReduce(sdata,tid);

Thanks!

sunsetquest · October 30, 2011, 9:45pm

If you look at the loop before the optimization you will notice that the loop runs until “s>0” and if you look after the optimization it runs to “s>32”. This results in 5 less iteration in the loop portion. And since all the warps execute the loop portion, this saves work because all the warps would have 5 fewer iterations. I think this is what Mark Harris is referring to.

Before…

for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

  if (tid < s)

    sdata[tid] += sdata[tid + s];

  __syncthreads();

}

After…

for(unsigned int s=blockDim.x/2; s>32; s>>=1) {

  if (tid < s)

    sdata[tid] += sdata[tid + s];

  __syncthreads();

}

if (tid < 32)warpReduce(sdata, tid);

dgetrf · December 11, 2011, 2:11am

why is that ‘volatile’ necessary?

tera · December 11, 2011, 1:17pm

Because from the perspective of each single thread the contents of [font=“Courier New”]sdata[/font] can change outside of that thread’s control, and no barriers ([font=“Courier New”]__syncthreads()[/font]) are used.

Topic		Replies	Views
ask help about the SDK demo: reduction CUDA Programming and Performance	5	1224	March 31, 2010
SDK Reduce Example CUDA Programming and Performance	4	3116	August 12, 2009
Shape of 2D warp Which threads are in a 2D warp? CUDA Programming and Performance	3	4624	November 10, 2008
questions about thread execution & volatile CUDA Programming and Performance	19	16954	December 29, 2008
are threads of a warp really sync? CUDA Programming and Performance	2	795	August 3, 2011
Half WRAP -- NEWBIE help CUDA Programming and Performance	7	5664	November 4, 2008
Warp Serialize CUDA Programming and Performance	1	2726	November 4, 2008
Reduction unrolling problem cuda Reduction unrolled CUDA Programming and Performance	0	685	July 29, 2010
Divergent warps Divegent warps CUDA Programming and Performance	2	1004	October 30, 2011
parallel scan without syncthreads CUDA Programming and Performance	11	7175	November 2, 2010

Unrolling warps

Related topics