troubles with local memory and work items

Hello everyone!
I am a starer of OpenCL. I have some questions about the local memory and work items in 2D dimension.
I have written one kernel to compute the max 2x2 value from original matrix with float data (size: 1024x1024).
General in order to accelerare the compute, we need transfer submatrix from glbal memory into local memory.
For example, in the configuration procedure, i have set as follows:
global [0]=1204;
global [1]=1024;
local [0]=8;
local [1]=8;
Next step, I want to compute the max value of every four element (in red circle) in the local memory like figure 1.

Figure 1
In addition, the work items located in the local memory as figure 2.

Figure 2
I think when i compute the max value of every four elements, i only use the first workitem of four elements. For example workitem[0,0], [0,2], [0,4] ,[0,6] …are these workitems i used to compute the max value of every four elements. The anthers workitems will be ignored in the parallel computing.

I want to know, if there is another way to allocated the work items in local memory like figure3


Figure 3
It is possible? In addition, how to realize figure 3 in OpenCL kernel.
In this situation, to process submarix 8x8 in local memory. Should I change the local size into

It mean that with the max number (512) workitems in one workgroup, I can compute four times data than the method which the figure 2 illustrated at the same time.
That’s all right? May be i made some mistakes,
when there is only one datasource, how to prosess itself in the same time (like max2x2, max4x4, max8x8 and so on)
Pleaes give me some advieces , thank you!

This is the kernel codes to realize the method the figure 2 illustrated.

#define BLOCKSIZE 16
__kernel void MatrixMax22refine1( const __global float* Matrix,
uint width, uint height,
__global float* Matrix_dst)
uint lx = get_local_id(0);
uint ly = get_local_id(1);
int gx = get_group_id(0);
int gy = get_group_id(1);
uint x = 2get_local_id(0);
uint y = 2

// calculate the starting index of the global array for the each sub matrix
uint iSubA = BLOCKSIZE * gy * width;
uint iSubB = BLOCKSIZE * gx;

// get the number of groups in
int n = get_num_groups(0);

// varaiable to hold the running total
float sum = 0;

// declare local memory for each sub matrix
__local float tA[BLOCKSIZE+1][BLOCKSIZE+1];

// divide the submatrix with workgroup ID.    
for(int i=0; i< n;i++)
  // copy a portion of the input matrices into the sub matrices
  tA[ly][lx] = Matrix[ly*width + lx + (iSubB + i* BLOCKSIZE * width)]; 
  // wait for all work-items int the group to finish copying

  //compute the max22 value 
  __local float a,b,c,d;
  if( x < get_local_size(0) && y < get_local_size(1))
  a = tA[y][x];
  b = tA[y][x+1];
  c = tA[y+1][x];
  d = tA[y+1][x+1];
  tA[y][x]= max(max(a,b),max(c,d));
  tA[y][x+1]= max(max(a,b),max(c,d));
  tA[y+1][x]= max(max(a,b),max(c,d));
  tA[y+1][x+1]= max(max(a,b),max(c,d));
  Matrix_dst[ly*width + lx + (iSubB + i* BLOCKSIZE * width)] = tA[ly][lx];