Hello everybody

I am trying to get familiar with opencl. To do so, I tried tro write a kernel which is supposed to calculate the sum of vector elements. I used the “Parallel reduction without shared memory bank conflicts” from the “OpenCL Programming for the CUDA Architecture” document provided by nvidia.

My GPU: nvidia quadro nvs 140m

The Kernel

```
__kernel void sum(__global const float *A,__global float *C,uint size, __local float *L) {
float sum=0;
for(int i=get_global_id(0);i<size;i+=get_local_size(0))
sum+=A[i];
L[get_local_id(0)]=sum;
for(uint c=get_local_size(0)/2;c>0;c/=2)
{
barrier(CLK_LOCAL_MEM_FENCE);
if(c<get_local_id(0))
L[get_local_id(0)]+=L[get_local_id(0)+c];
}
if(get_local_id(0)==0)
C[0]=L[0];
barrier(CLK_LOCAL_MEM_FENCE);
}
```

The local work size is 256 and the global work size is the next upper multiple of 256.

If I run my program with a vector containting 1000 elements which are all 1, the kernel return 4 (instead of 1000 of course). I assume it has something to do with the barrier.

Additionaly, the kernel is at least one order of magnitude slower than the cpu.

I would be happy for any advice on what the problem could be.