Please do not consider this as cross post as I un-intentionally posted at CUDA-VIsta forum because my OS is VISTA 32.

As clear from subject, I have a vector and want to implement reduction to find its components sum. I followed CUDA SDK examples but being a novice I want to simplify the example. The example seems to be optimized. To do so, I randomly referred to a tutorial on reduction and took the code as under:

```
float *a_h, *a_d, *b_h, *b_d;
float *result;
//compute dot product between a_h and b_h inside CUDA.
//initialize a_h and b_h
a_h = (float *)malloc(4 * sizeof(float) ); //a_h and b_h are 1*4 vectors. 4 elements only -- for testing
b_h = (float *)malloc(4 * sizeof(float) );
cudaMalloc((void **) &a_d, 4 * sizeof(float));
cudaMalloc((void **) &b_d, 4 * sizeof(float));
//initialize a_h and b_h
...
Lets say a_h = <0,1,2,3> = b_h
//copy stuff
cudaMemcpy(a_d, a_h, sizeof(float)*4, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b_h, sizeof(float)*4, cudaMemcpyHostToDevice);
//Do component wise multiplication
dotProduct<<<nBlocks , nBlockSize >>>(a_d, b_d, 4); //nBlocks = 1 and nBlockSize = 4 (...threads)
.....
```

Now after dotProduct(…)

I get expected result as <0,1,4,9>

For this vector I wish to perform reduction to get final sum as 14.

This is my reduction function.

```
__global__ void reductionOnDevice(float *pVector, unsigned int N, unsigned int BlockStride)
{
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int Stride = 2 * BlockStride;
unsigned int j = blockDim.x;
while(j>0)
{
if(Stride * i < N)
pVector[Stride*i] += pVector[Stride*i + (Stride * 2)];
Stride *= 2;
j /= 2;
__syncthreads();
}//end while
}//end of reductionOnDevice()
```

Following is called inside main() just after I get result from dotProduct().

```
//call reduction function in CUDA to compute final sum
while(nBlocks > 0)
{
reductionOnDevice<<<nBlocks, blockSize/2 >>>(c_d, N , Stride);
Stride *= blockSize;
N /= 2;
nBlocks /= blockSize;
}
```

After this I do:

```
cudaMemcpy(c_h, c_d, sizeof(float)*4, cudaMemcpyDeviceToHost);
// check results
printf("\n");
for (i=0; i<4; i++)
printf("c_h[%d]=%f\n", i , c_h[i]);
.....
```

Now the result I am getting is <0 , 1,4,9> . => No change.

Do you know where the problem lies?Thank you.