Regarding Vector Reduction To find sum of all components of a vector

Please do not consider this as cross post as I un-intentionally posted at CUDA-VIsta forum because my OS is VISTA 32.

As clear from subject, I have a vector and want to implement reduction to find its components sum. I followed CUDA SDK examples but being a novice I want to simplify the example. The example seems to be optimized. To do so, I randomly referred to a tutorial on reduction and took the code as under:

float *a_h, *a_d, *b_h, *b_d;

float *result;

//compute dot product between a_h and b_h inside CUDA.

//initialize a_h and b_h

a_h = (float *)malloc(4 * sizeof(float) );			//a_h and b_h are 1*4 vectors. 4 elements only -- for testing

b_h = (float *)malloc(4 * sizeof(float) );

cudaMalloc((void **) &a_d, 4 * sizeof(float));

cudaMalloc((void **) &b_d, 4 * sizeof(float));

//initialize a_h and b_h

...

Lets say a_h = <0,1,2,3> = b_h

//copy stuff

cudaMemcpy(a_d, a_h, sizeof(float)*4, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, b_h, sizeof(float)*4, cudaMemcpyHostToDevice);

//Do component wise multiplication

dotProduct<<<nBlocks , nBlockSize >>>(a_d, b_d, 4);  //nBlocks = 1 and nBlockSize = 4 (...threads)

.....

Now after dotProduct(…)

I get expected result as <0,1,4,9>

For this vector I wish to perform reduction to get final sum as 14.

This is my reduction function.

__global__ void reductionOnDevice(float *pVector, unsigned int N, unsigned int BlockStride)

{

	  unsigned int i = blockIdx.x * blockDim.x  + threadIdx.x;

	  unsigned int Stride = 2 * BlockStride;

	  unsigned int j = blockDim.x;

	  while(j>0)

	  {

		  if(Stride * i < N)

			  pVector[Stride*i] += pVector[Stride*i + (Stride * 2)];

		  Stride *= 2;

		  j /= 2;

		  __syncthreads();

		

	  }//end while

}//end of reductionOnDevice()

Following is called inside main() just after I get result from dotProduct().

//call reduction function in CUDA to compute final sum

  while(nBlocks > 0)

  {

	reductionOnDevice<<<nBlocks, blockSize/2 >>>(c_d, N , Stride);

	Stride *= blockSize;

	N /= 2;

	nBlocks /= blockSize;

  }

After this I do:

cudaMemcpy(c_h, c_d, sizeof(float)*4, cudaMemcpyDeviceToHost);

// check results

  printf("\n");

  for (i=0; i<4; i++)

	  printf("c_h[%d]=%f\n", i , c_h[i]);

.....

Now the result I am getting is <0 , 1,4,9> . => No change.

Do you know where the problem lies?Thank you.