Hello everyone!

I am a starer of OpenCL. I have some questions about the local memory and work items in 2D dimension.

I have written one kernel to compute the max 2x2 value from original matrix with float data (size: 1024x1024).

General in order to accelerare the compute, we need transfer submatrix from glbal memory into local memory.

For example, in the configuration procedure, i have set as follows:

global [0]=1204;

global [1]=1024;

local [0]=8;

local [1]=8;

Next step, I want to compute the max value of every four element (in red circle) in the local memory like figure 1.

Figure 1

In addition, the work items located in the local memory as figure 2.

Figure 2

I think when i compute the max value of every four elements, i only use the first workitem of four elements. For example workitem[0,0], [0,2], [0,4] ,[0,6] â€¦are these workitems i used to compute the max value of every four elements. The anthers workitems will be ignored in the parallel computing.

I want to know, if there is another way to allocated the work items in local memory like figure3

Figure 3

It is possible? In addition, how to realize figure 3 in OpenCL kernel.

In this situation, to process submarix 8x8 in local memory. Should I change the local size into

Local[0]=4;

Local[1]=4;

It mean that with the max number (512) workitems in one workgroup, I can compute four times data than the method which the figure 2 illustrated at the same time.

That’s all right? May be i made some mistakes,

when there is only one datasource, how to prosess itself in the same time (like max2x2, max4x4, max8x8 and so on)

Pleaes give me some advieces , thank you!

This is the kernel codes to realize the method the figure 2 illustrated.

#define BLOCKSIZE 16

__kernel void MatrixMax22refine1( const __global float* Matrix,

uint width, uint height,

__global float* Matrix_dst)

{

uint lx = get_local_id(0);

uint ly = get_local_id(1);

int gx = get_group_id(0);

int gy = get_group_id(1);

uint x = 2*get_local_id(0);
uint y = 2*get_local_id(1);

```
// calculate the starting index of the global array for the each sub matrix
uint iSubA = BLOCKSIZE * gy * width;
uint iSubB = BLOCKSIZE * gx;
// get the number of groups in
int n = get_num_groups(0);
// varaiable to hold the running total
float sum = 0;
// declare local memory for each sub matrix
__local float tA[BLOCKSIZE+1][BLOCKSIZE+1];
// divide the submatrix with workgroup ID.
for(int i=0; i< n;i++)
{
// copy a portion of the input matrices into the sub matrices
tA[ly][lx] = Matrix[ly*width + lx + (iSubB + i* BLOCKSIZE * width)];
// wait for all work-items int the group to finish copying
barrier(CLK_LOCAL_MEM_FENCE);
//compute the max22 value
__local float a,b,c,d;
if( x < get_local_size(0) && y < get_local_size(1))
{
a = tA[y][x];
b = tA[y][x+1];
c = tA[y+1][x];
d = tA[y+1][x+1];
tA[y][x]= max(max(a,b),max(c,d));
tA[y][x+1]= max(max(a,b),max(c,d));
tA[y+1][x]= max(max(a,b),max(c,d));
tA[y+1][x+1]= max(max(a,b),max(c,d));
}
Matrix_dst[ly*width + lx + (iSubB + i* BLOCKSIZE * width)] = tA[ly][lx];
}
```

}