I have the following parallel kernel reduction written on OpenCL. I just want to sum all the values from the BlockSum array. While using the work_group_reduce_add(BlockSum[GetIndex]); it works perfectly right, using the optimized code below (in comments) I read from https://www.fz-juelich.de/SharedDocs/Downloads/IAS/JSC/EN/slides/opencl/opencl-05-reduction.pdf?__blob=publicationFile (Slide 11) does not work correctly. What seems to be the error here? The global_work_size is set to {16,16} as well as the local_work_size (meaning 256 threads or cores in total for each workgroup or multiprocessor). In the case of the work_group_reduce_add I get 255 which is correct but with the optimized code, I get 0s.
__kernel void Reduction()
{
unsigned char GetThreadX = get_local_id(0); //it takes values from 0..15
unsigned char GetThreadY = get_local_id(1); //it takes values from 0..15
unsigned char GetGroup = get_local_size(0); //16
unsigned short BlockSum[256];
int SumOfAll= 0;
unsigned short GetIndex = GetThreadX + (GetGroup * GetThreadY); // takes values 0..255, group=16
BlockSum[GetIndex] = 1;
barrier(CLK_LOCAL_MEM_FENCE);
SumOfAll= work_group_reduce_add(BlockSum[GetIndex]); //works great
// --> BUT THE OPTIMIZED CODE BELOW DOES NOT SUM CORRECTLY
/*
for(unsigned short stride=128; stride>1; stride >>= 1) {
if(GetIndex < stride)
BlockSum[GetIndex] += BlockSum[GetIndex + stride];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(GetIndex==0)
SumOfAll = BlockSum[0] + BlockSum[1];
barrier(CLK_LOCAL_MEM_FENCE);
*/
printf("SumOfAll=%d\n",SumOfAll);
}
That is private memory, which means it is separate for each work-item. Your commented out code expects/intends to use local memory, not private memory. The OpenCL primitive, on the other hand, is only using one element of private memory per work-item, and is doing its actual reduction work using some other method (e.g. its own internal declaration of local memory.)
I thought -by mistake of course- that the array is already declared as local, but after your helpful reply I can clearly say that was a mistake since it was declared as private.
Tomorrow as soon as I go back to my office, I will try to change the declaration of the array, and I will give it a try.
Dear Robert, as I already said in my previous post you had absolutely right. I was not very cautious about the memory address space, and I am pretty sure that I would not be thinking about that, without your precious help.
You saved my time!!! Thank you for that.
So the working kernel should be like this.
__kernel void Reduction()
{
unsigned char GetThreadX = get_local_id(0); //it takes values from 0..15
unsigned char GetThreadY = get_local_id(1); //it takes values from 0..15
unsigned char GetGroup = get_local_size(0); //16
//*********************************************************
//below was the offending code and the root of the problem
//**********************************************************
__local unsigned short BlockSum[256];
int SumOfAll;
//**********************************************************
unsigned short GetIndex = GetThreadX + (GetGroup * GetThreadY); // takes values 0..255, group=16
BlockSum[GetIndex] = 1;
barrier(CLK_LOCAL_MEM_FENCE);
//SumOfAll = work_group_reduce_add(BlockSum[GetIndex]);
// OPTIMIZED CODE BELOW NOW SUM UP CORRECTLY
for(unsigned short stride=128; stride>1; stride >>= 1) {
if(GetIndex < stride)
BlockSum[GetIndex] += BlockSum[GetIndex + stride];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(GetIndex==0)
SumOfAll = BlockSum[0] + BlockSum[1];
barrier(CLK_LOCAL_MEM_FENCE);
printf("SumOfAll=%d\n",SumOfAll);
}