Explanation of Shared Memory Bank Conflicts for Reduction Example?

Hi,

I’m a little confused as to how to use sequential addressing to avoid shared memory bank conflicts in Mark Harris’ reduction example in the NVidia SDK.

I understand that shared memory is split up into 32-bit banks and that each bank can only be accessed by one thread per instruction cycle. If I was to write to four bytes of shared data per thread, I would be required to arrange the data such that each four bytes falls into a different bank of shared data, thus spacing the data out so that none of them lie in the same bank.

However in the reduction example, sequential addressing is used to prevent shared memory bank conflicts, which seems to be contradictory to this.

Can someone explain this concept to me?

Thanks.

any thoughts?

It is all explained in the Cuda programming guide. Address 0, 16, 32… belong to bank 0, address 1, 17, 33… belong to bank 1 and so on.

A related question. The example unrolls the last warp as follows:

if (tid < 32)
{
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}

When I tried this with a parallel sum, however, I needed another call to __syncthreads() to get correct results:

if (tid < 32)
{
sdata[tid] += sdata[tid + 32];
__syncthreads();
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}

Why is this the case?