 # Regarding Vector Reduction To find sum of all components of a vector

Please do not consider this as cross post as I un-intentionally posted at CUDA-VIsta forum because my OS is VISTA 32.

As clear from subject, I have a vector and want to implement reduction to find its components sum. I followed CUDA SDK examples but being a novice I want to simplify the example. The example seems to be optimized. To do so, I randomly referred to a tutorial on reduction and took the code as under:

``````float *a_h, *a_d, *b_h, *b_d;

float *result;

//compute dot product between a_h and b_h inside CUDA.

//initialize a_h and b_h

a_h = (float *)malloc(4 * sizeof(float) );			//a_h and b_h are 1*4 vectors. 4 elements only -- for testing

b_h = (float *)malloc(4 * sizeof(float) );

cudaMalloc((void **) &a_d, 4 * sizeof(float));

cudaMalloc((void **) &b_d, 4 * sizeof(float));

//initialize a_h and b_h

...

Lets say a_h = <0,1,2,3> = b_h

//copy stuff

cudaMemcpy(a_d, a_h, sizeof(float)*4, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, b_h, sizeof(float)*4, cudaMemcpyHostToDevice);

//Do component wise multiplication

dotProduct<<<nBlocks , nBlockSize >>>(a_d, b_d, 4);  //nBlocks = 1 and nBlockSize = 4 (...threads)

.....
``````

Now after dotProduct(…)

I get expected result as <0,1,4,9>

For this vector I wish to perform reduction to get final sum as 14.

This is my reduction function.

``````__global__ void reductionOnDevice(float *pVector, unsigned int N, unsigned int BlockStride)

{

unsigned int i = blockIdx.x * blockDim.x  + threadIdx.x;

unsigned int Stride = 2 * BlockStride;

unsigned int j = blockDim.x;

while(j>0)

{

if(Stride * i < N)

pVector[Stride*i] += pVector[Stride*i + (Stride * 2)];

Stride *= 2;

j /= 2;

}//end while

}//end of reductionOnDevice()
``````

Following is called inside main() just after I get result from dotProduct().

``````//call reduction function in CUDA to compute final sum

while(nBlocks > 0)

{

reductionOnDevice<<<nBlocks, blockSize/2 >>>(c_d, N , Stride);

Stride *= blockSize;

N /= 2;

nBlocks /= blockSize;

}
``````

After this I do:

``````cudaMemcpy(c_h, c_d, sizeof(float)*4, cudaMemcpyDeviceToHost);

// check results

printf("\n");

for (i=0; i<4; i++)

printf("c_h[%d]=%f\n", i , c_h[i]);

.....
``````

Now the result I am getting is <0 , 1,4,9> . => No change.

Do you know where the problem lies?Thank you.