Hello everyone !

I need to do a reduction for my program !

So I’ve read the doc of Nvidia about it (good paper btw) and now, I’m trying to do the same !

And obviously, it does not work !

I’m doing exactly the same thing than the first example of the SDK so I assume my mistake is about the ThreadPerBlock and/or the DimGrid I’ve choosen !

I’ve tried with lots of different values but the result is the same !

Here is my code :

```
// Kernel that executes on the CUDA device
__global__ void reduction(float *result_in,float*result_out)
{
extern __shared__ float sdata[];
unsigned int s=1;
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = result_in[i];
__syncthreads();
for(s=1; s<blockDim.x;s *=2)
{
if (tid % (2*s) == 0)
{
sdata[tid] += sdata[tid+s];
}
__syncthreads();
}
if(tid==0) result_out[blockIdx.x] = sdata[0];
}
int main()
{
float * d_result;
float * d_result1;
int W = 16;
float h_result[W];
float h_result1[W];
int i=0;
for (i=0;i<W;i++) h_result[i]=i;
cudaMalloc((void**)&d_result,W*sizeof(float));
cudaMalloc((void**)&d_result1,W*sizeof(float));
cudaMemcpy(d_result,h_result,W*sizeof(float),cudaMemcpyHostToDevice);
dim3 threadPerBlock(16,2);
dim3 dimGrid(W/threadPerBlock.x , 1);
reduction<<<dimGrid,threadPerBlock>>>(d_result,d_result1);
cudaMemcpy(h_result1,d_result1,W*sizeof(float),cudaMemcpyDeviceToHost);
int res=0;
for(int k=0;k<W; k++) res+= h_result[k];
cout << "Res : " << res << endl;
cout << "Res2 : " << h_result1[0] << endl;
cudaFree(d_result);
cudaFree(d_result1);
return 0;
}
```

I’ve chosen a power of two so it should work ! and I get 28 instead of 120. Can somebody explains me which newby mistake I am doing ?

For information :