Parallel Reduction does not work correctly

g_papaioannou · November 11, 2021, 8:23am

I have the following parallel kernel reduction written on OpenCL. I just want to sum all the values from the BlockSum array. While using the work_group_reduce_add(BlockSum[GetIndex]); it works perfectly right, using the optimized code below (in comments) I read from https://www.fz-juelich.de/SharedDocs/Downloads/IAS/JSC/EN/slides/opencl/opencl-05-reduction.pdf?__blob=publicationFile (Slide 11) does not work correctly. What seems to be the error here? The global_work_size is set to {16,16} as well as the local_work_size (meaning 256 threads or cores in total for each workgroup or multiprocessor). In the case of the work_group_reduce_add I get 255 which is correct but with the optimized code, I get 0s.

__kernel void Reduction()
{
        unsigned char GetThreadX = get_local_id(0); //it takes values from 0..15
        unsigned char GetThreadY = get_local_id(1); //it takes values from 0..15
        unsigned char GetGroup   = get_local_size(0); //16
        unsigned short  BlockSum[256];      
        int SumOfAll= 0;            
        
        unsigned short GetIndex = GetThreadX + (GetGroup * GetThreadY); // takes values 0..255, group=16        
        
        BlockSum[GetIndex] = 1;             
        barrier(CLK_LOCAL_MEM_FENCE);       
        
        SumOfAll= work_group_reduce_add(BlockSum[GetIndex]); //works great  
        
        // --> BUT THE OPTIMIZED CODE BELOW DOES NOT SUM CORRECTLY
        /*
        for(unsigned short stride=128; stride>1; stride >>= 1) {
            
            if(GetIndex < stride)
                BlockSum[GetIndex] += BlockSum[GetIndex + stride];          
            barrier(CLK_LOCAL_MEM_FENCE);           
        }               
        if(GetIndex==0)             
            SumOfAll = BlockSum[0] + BlockSum[1];       
        barrier(CLK_LOCAL_MEM_FENCE);
        */
        printf("SumOfAll=%d\n",SumOfAll);
}

Robert_Crovella · November 11, 2021, 2:08pm

I suggest you learn how to write a reduction using OpenCL local memory. This is not how you declare local memory:

That is private memory, which means it is separate for each work-item. Your commented out code expects/intends to use local memory, not private memory. The OpenCL primitive, on the other hand, is only using one element of private memory per work-item, and is doing its actual reduction work using some other method (e.g. its own internal declaration of local memory.)

g_papaioannou · November 11, 2021, 5:21pm

Thank you. I think you have right!!!

I thought -by mistake of course- that the array is already declared as local, but after your helpful reply I can clearly say that was a mistake since it was declared as private.

Tomorrow as soon as I go back to my office, I will try to change the declaration of the array, and I will give it a try.

Anyway,I will let you know about the results.

Best regards,

g_papaioannou · November 12, 2021, 8:49am

Dear Robert, as I already said in my previous post you had absolutely right. I was not very cautious about the memory address space, and I am pretty sure that I would not be thinking about that, without your precious help.

You saved my time!!! Thank you for that.

So the working kernel should be like this.

__kernel void Reduction()
{
        unsigned char GetThreadX = get_local_id(0); //it takes values from 0..15
        unsigned char GetThreadY = get_local_id(1); //it takes values from 0..15
        unsigned char GetGroup   = get_local_size(0); //16

        //*********************************************************
        //below was the offending code and the root of the problem 
        //**********************************************************
        __local unsigned short  BlockSum[256];      
        int SumOfAll;           
        //**********************************************************
        
        unsigned short GetIndex = GetThreadX + (GetGroup * GetThreadY); // takes values 0..255, group=16        
        
        BlockSum[GetIndex] = 1;             
        barrier(CLK_LOCAL_MEM_FENCE);       
        
        //SumOfAll = work_group_reduce_add(BlockSum[GetIndex]); 
        
        // OPTIMIZED CODE BELOW NOW SUM UP CORRECTLY
        
        for(unsigned short stride=128; stride>1; stride >>= 1) {
            
            if(GetIndex < stride)
                BlockSum[GetIndex] += BlockSum[GetIndex + stride];          
            barrier(CLK_LOCAL_MEM_FENCE);           
        }               
        if(GetIndex==0)             
            SumOfAll = BlockSum[0] + BlockSum[1];       
        barrier(CLK_LOCAL_MEM_FENCE);
        
        printf("SumOfAll=%d\n",SumOfAll);
        


        
    
}

Topic		Replies	Views
Parallel Sum Reduction Problem CUDA Programming and Performance	5	4596	July 29, 2011
Local Reduction Sum Error in OpenCL CUDA Programming and Performance	3	1695	February 6, 2013
Simple Inefficient Parallel Addition CUDA Programming and Performance	5	3276	April 10, 2009
porting opencl pseudocode to CUDA CUDA Programming and Performance	11	3276	June 5, 2011
Parallel reduction problem CUDA Programming and Performance	1	5146	November 29, 2010
Need help on Paralle Reduction CUDA Programming and Performance	0	1110	March 8, 2010
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1910	January 14, 2009
2D Parallel Reduction CUDA Programming and Performance	1	6634	February 3, 2010
Simple (honest!) change to parallel reduction example yields bizarre result? CUDA Programming and Performance	1	2502	December 26, 2011
I want to ask parallel reduction.... CUDA Programming and Performance	0	516	January 30, 2019

Parallel Reduction does not work correctly

Related topics