As specified in subject, I am referring to reduction codes given in the whitepaper on Reductions in NVIDIA’s CUDA SDK Example Set.

The first version on reduction at shared memory using Interleaved addressing, works well if number of elements are all contained in one block but fails if there are more than one block.

To illustrate and make it clear:

I took a vector containing 4 elements <0…4> and blockSize =4 => numBlocks = 1 => Worked fine …expected result is 10.

When I take vector of length 5 and blockSize as 8 it still works fine.

But when I take vector size as 9 or 10 then kernel gives me result <28,17> as the final result => Intermediate result.There is still another step required

to complete the sum as 28+17.

```
__global__ void reduce0(float *g_idata, float *g_odata, unsigned int n)
{
extern __shared__ float sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2)
{
if (tid % (2*s) == 0)
{
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
```

and in my main function I am calling it as :–

```
void main()
{
...
while(nBlocks > 0)
{
reduce0<<<nBlocks, blockSize>>>(a_d,result,N); //call function
N /= 2;
nBlocks /= 2; //num of blocks reduced by half.
}
..
}
```

Being new to CUDA, I know there must be some trivial error that I may be making. Please help me correct.Thanks.