:rolleyes:

I am a stater of opencl, and l try to wirte the kernel of max22 wihch read the submatrix 2x2 from a big matrix and compute the max value of the four element, then use the max value to fill the submatrix.

i have complete the kernel with the help of nvidia’s sdk. it output the right results.

but i meature the consumed time which is more than the C code consumed.

for example, the Matrix 2000x2000 the consumed time of c code is 31ms, but using my kernel, the time is more than 9000ms.

my kernel is as folllow:

#define BLOCKSIZE 2

// Matrix multiplication kernel called by MatrixMul()

__kernel void MatrixMax22( __global float* Matrix,

uint width, uint height,

__global float* Matrix_dst)

{

uint lx = get_local_id(0);

uint ly = get_local_id(1);

int gx = get_group_id(0);

int gy = get_group_id(1);

```
// calculate the starting index of the global array for the each sub matrix
uint iSubA = BLOCKSIZE * gy * width;
// get the number of groups in
int n = get_num_groups(0);
// for each block
for(int i=0; i< n;i++)
{
// declare local memory for each sub matrix
__local float tA[BLOCKSIZE][BLOCKSIZE];
// copy a portion of the input matrices into the sub matrices
tA[ly][lx] = Matrix[ly*width + lx + (iSubA + i* BLOCKSIZE)];
// wait for all work-items int the group to finish copying
barrier(CLK_LOCAL_MEM_FENCE);
//find out the max22 value
__local float a,b,c,d,a1,b1,result;
a = tA[0][0];
b = tA[0][1];
c = tA[1][0];
d = tA[1][1];
a1 = max(a,B);
b1 = max(c,d);
result = max(a1,b1);
tA[0][0]= result;
tA[0][1]= result;
tA[1][0]= result;
tA[1][1]= result;
Matrix_dst[ly*width + lx + (iSubA + i* BLOCKSIZE)] = tA[ly][lx];
}
```

}

I know it have mistakes, i need help and to correct.

Thank you!