hello everybody
Ive just started programming with CUDA and dont understand some things so i will ask you guys. There is popular reduction algorithm like this:
for(unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
And i dont really understand why it hasn’t got bank conflicts. Lets say i have got 64 int numbers array copied into sdata. If divide by 2 i have got 32 threads working on it. It makes 1 wrap full couse 32 threads in wrap. But element 0 and 16 in this array belongs to the same bank (bank 0) so should be bank conflict when thread 0 and 16 reads the same data? I guess there is something wrong in my way of thinking so im asking for explanation.