CUDA Reduction

hello everybody

Ive just started programming with CUDA and dont understand some things so i will ask you guys. There is popular reduction algorithm like this:

for(unsigned int s=blockDim.x/2; s>0; s>>=1) 
    if (tid < s) 
        sdata[tid] += sdata[tid + s];

And i dont really understand why it hasn’t got bank conflicts. Lets say i have got 64 int numbers array copied into sdata. If divide by 2 i have got 32 threads working on it. It makes 1 wrap full couse 32 threads in wrap. But element 0 and 16 in this array belongs to the same bank (bank 0) so should be bank conflict when thread 0 and 16 reads the same data? I guess there is something wrong in my way of thinking so im asking for explanation.

Good question.

The answer is easy: memory accesses to shared and global memory both are dealt with by HALF warps.
For shared memory bank conflicts, you’ll get conflicts when two threads in the same half-warp both try to read from the same bank.
In your example, thread 0 and thread 16 do indeed read from the same bank, but they’re in different half-warps, so you’re OK.

Ok, now i understand.
Thanks for help.