Local Reduction Sum Error in OpenCL

Hi, I have a problem with a buffer sum reduction. This is the kernel code:
__kernel void evaluate(__global const uint *pFirstBuffer,
__global const double *pSecondBuffer,
const uint N,
__global uint *pPartialSumBuffer,
__global uint *pOutTestBuffer)
{
uint gid = get_global_id(0);
uint lid = get_local_id(0);
uint grp_id = get_group_id(0);

pPartialSumBuffer[grp_id] = 0;

__local uint lcl_acc[LOCAL_WORK_SIZE]; //I call this kernel with a local work size of 256
lcl_acc[lid] = 0; //and a global size of 220928 (multiple of 256)
barrier(CLK_LOCAL_MEM_FENCE);

if (gid <= N)
{
double output = … //Here i do some simple computation using pSecondBuffer

if (output <= 0)
  pOutTestBuffer[gid] = 1;
else 
  pOutTestBuffer[gid] = 0;

if (output <= 0)
{
  lcl_acc[lid] = 1;
} 

}

uint priv_acc = lcl_acc[lid];
barrier(CLK_LOCAL_MEM_FENCE);

uint dist = LOCAL_WORK_SIZE;
while ( dist > 1 )
{
dist >>= 1;
if (gid <= N && lid < dist)
{
priv_acc += lcl_acc[lid + dist];
lcl_acc[lid] = priv_acc;
}
barrier(CLK_LOCAL_MEM_FENCE);
}

if (gid <= N && lid == 0)
{
pPartialSumBuffer[grp_id] = priv_acc;
}
}

Now, after kernel execution, if I check the number of ‘1’ values reducted in pPartialSumBuffer I obtain 2249, while if I check the number of ‘1’ in pOutTestBuffer I obtain 2248.
Why this difference? Something wrong in the way I reduce the local buffer in pPartialSumBuffer?
Tks to all.

The only thing that I can see that looks slightly suspicious is that all threads write to pPartialSum buffer at the start, which can conflict with the final write by thread 0 (there are barriers in between, but they don’t specify CLK_GLOBAL_MEM_FENCE). It would probably be safer (and also more efficient) to remove the initial write and to do the final write regardless of whether gid <= N (the initialization of lcl_acc should be enough to ensure that you will write a zero in this case anyway).

Other than that, have you tried checking which group is getting the wrong sum? If it is the last group then you should probably look at how the edge conditions are handled, but if it’s a group somewhere in the middle then it’s more likely a race condition like this.

Hi, tks for your reply, but I tried to avoid the initialization of pPartialSumbuffer and I write in it at the last instruction only if lid == 0 (I don’t check anymore gid <= N).
But It doesn’t work the same.
That’s so strange, how can I check whick work group gave me this error?

I tried to check which work group fails, it should be the last one, that counts 1, while the sum on the last part of pOutTestBuffer is 0. What do you mean when you say ‘edge conditions are handled’?
Tks a lot.