ask help about the SDK demo: reduction

I am learning the cuda project demo reduction, and I have a question about the following code:
in the function:
template <class T, unsigned int blockSize>
global void
FUNC(reduce4)(T *g_idata, T *g_odata, unsigned int n)
{
SharedMemory smem;
T *sdata = smem.getPointer();

// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;

sdata[tid] = (i < n) ? g_idata[i] : 0;
if (i + blockSize < n) 
    sdata[tid] += g_idata[i+blockSize];

__syncthreads();

// do reduction in shared mem
for(unsigned int s=blockDim.x/2; s>32; s>>=1)
{
    if (tid < s)
    {
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}

#ifndef DEVICE_EMULATION
if (tid < 32)
#endif
{
if (blockSize >= 64) { sdata[tid] += sdata[tid + 32]; EMUSYNC; }
if (blockSize >= 32) { sdata[tid] += sdata[tid + 16]; EMUSYNC; }
if (blockSize >= 16) { sdata[tid] += sdata[tid + 8]; EMUSYNC; }
if (blockSize >= 8) { sdata[tid] += sdata[tid + 4]; EMUSYNC; }
if (blockSize >= 4) { sdata[tid] += sdata[tid + 2]; EMUSYNC; }
if (blockSize >= 2) { sdata[tid] += sdata[tid + 1]; EMUSYNC; }
}

// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

When unrolling the last 6 iterations, the last few lines I think should be:
if (tid<32) { sdata[tid] += sdata[tid + 32]; }
if (tid<16) { sdata[tid] += sdata[tid + 16]; }
if (tid<8 ) { sdata[tid] += sdata[tid + 8]; }
if (tid<4 ) { sdata[tid] += sdata[tid + 4]; }
if (tid<2 ) { sdata[tid] += sdata[tid + 2]; }
if (tid<1 ) { sdata[tid] += sdata[tid + 1]; }

I really don’t unstand why the programmer do not need to check “if(tid<s)”. Can anyone tell me?
Thanks.

If I understand this correctly, if you remove ‘if (tid < s)’, then each thread will modify shared mem with +=. That means that threads from other warps may read the modified result. In this particular case that ‘if’ disables writes of tid >= s threads which may screw up the input data for += operator for tid < s threads.

Edit:

Sorry - I didn’t quite answer your question. But because that ‘if’ is used to synchronize memory basically, you do not need to do that for threads inside the warp, since all threads inside a warp they are synchronized automagically on every instruction.

Sergey.

More on that, you can still rewrite that loop in a way where you don’t have that ‘if’ and still have one __syncthreads per iteration.
The idea is the following - you accumulate the summ not in shared memory, but in the registers using shared memory for data transfer only. This way you’ll also end up using twice less shared memory.

The actual code you could try to compose as an exercise ;).

Sergey.

Thanks for your explanation, but I think maybe I should make my question clearly.

When unrolling the last 6 iterations, I think the following code:

if (tid < 32)

{

{ sdata[tid] += sdata[tid + 32];}

{ sdata[tid] += sdata[tid + 16];}

{ sdata[tid] += sdata[tid + 8]; }

{ sdata[tid] += sdata[tid + 4]; }

{ sdata[tid] += sdata[tid + 2]; }

{ sdata[tid] += sdata[tid + 1]; }

}

should be written as the following:

if (tid<32) { sdata[tid] += sdata[tid + 32]; }

if (tid<16) { sdata[tid] += sdata[tid + 16]; }

if (tid<8 ) { sdata[tid] += sdata[tid + 8]; }

if (tid<4 ) { sdata[tid] += sdata[tid + 4]; }

if (tid<2 ) { sdata[tid] += sdata[tid + 2]; }

if (tid<1 ) { sdata[tid] += sdata[tid + 1]; }

When there is only 32 threads, the code “if (tid<32) { sdata[tid] += sdata[tid + 32]; }” works. After this step, we only want to reduce the data for thread idx (tid <16), thus we should write explicitly with " if (tid<16) { sdata[tid] += sdata[tid + 16]; } ", not like the origional code “{ sdata[tid] += sdata[tid + 16];}”. Similarly, we need “if (tid<8 ) { sdata[tid] += sdata[tid + 8]; }” to ensure the reduction only for thread idx (tid <8), etc.

I don’t understand why the origional code in the demos do not check the thread index, like "if (tid<16) ", “if (tid<8 )”, “if (tid<4 )”, etc. Actually, after “if (tid<32) { sdata[tid] += sdata[tid + 32]; }”, there are only 32 elements left. Thus we need thread 0~ 15 do the reduction, so I write “if (tid<16) { sdata[tid] += sdata[tid + 16]; }”. But in the demo, it writes “{ sdata[tid] += sdata[tid + 16];}”. In that case, all the threads (thread 0~31) would do the same thing, which I think is wrong. This really confused me! :blink:

I think I have made my question clearly enough this time, can you tell me once more? Thank you.

Because threads 0-31 belong to the same warp, they are implicitly synchronized, and the order of execution of those statements should be guaranteed. The fact that every thread in that first warp is executing each statement has no effect on the arithmetic result of the reduction, because the “valid” parts of the reduction sums are only being carried in the first powers of two part of the shared memory array at each stage (ie. all 32 hold valid partial sums after the first statement, then only the first 16 after the second, the first 8 after the third, etc).

I got it! Thanks for your explanation, avidday. I don’t need “if(tid<16)” to do “{ sdata[tid] += sdata[tid + 16]; }” because the result that thread 16~31 execute are thrown away. Actually we do not need that result.

Thank you again. I have been confused by that code for several days, and now I get it. :haha: