explain me the vector reduction

light86 · December 26, 2012, 11:03am

Hi all I’m new in cuda word
in this code

global__ void dot( float *a, float *b, float *c ) {
    __shared__ float cache[threadsPerBlock];
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int cacheIndex = threadIdx.x;
    float temp = 0;
    while (tid < N) {
        temp += a[tid] * b[tid];
        tid += blockDim.x * gridDim.x;
    }
    // set the cache values
    cache[cacheIndex] = temp;
    // synchronize threads in this block
    __syncthreads();
    // for reductions, threadsPerBlock must be a power of 2
    // because of the following code
    int i = blockDim.x/2;
    while (i != 0) {
        if (cacheIndex < i)
            cache[cacheIndex] += cache[cacheIndex + i];
        __syncthreads();
        i /= 2;
    }
    if (cacheIndex == 0)
        c[blockIdx.x] = cache[0];
}

I don’t understand the aim of int i = blockDim.x/2;
(the reduction is in the same block) PLease explain me

[/code]
Thank you

pasoleatis · December 26, 2012, 9:04pm

Let assume we have tpb=blockDim.x threads in our block. The reduction algorithm reduces at each iteration the amount of data to half. The first iteration transforms the data from tpb to tpb/2 this is why you start with that. In the next iteration you reduce the data to tpb/4 and so on until you get to 1 element which will contain the sum of everything.
You start with tpb/2 because this way you avoid the bank conflicts.
Check this image: