for help opencl kernel matrix22

hyzr · October 20, 2010, 9:41am

:rolleyes:
I am a stater of opencl, and l try to wirte the kernel of max22 wihch read the submatrix 2x2 from a big matrix and compute the max value of the four element, then use the max value to fill the submatrix.
i have complete the kernel with the help of nvidia’s sdk. it output the right results.
but i meature the consumed time which is more than the C code consumed.
for example, the Matrix 2000x2000 the consumed time of c code is 31ms, but using my kernel, the time is more than 9000ms.

my kernel is as folllow:
#define BLOCKSIZE 2

// Matrix multiplication kernel called by MatrixMul()
__kernel void MatrixMax22( __global float* Matrix,
uint width, uint height,
__global float* Matrix_dst)
{
uint lx = get_local_id(0);
uint ly = get_local_id(1);
int gx = get_group_id(0);
int gy = get_group_id(1);

// calculate the starting index of the global array for the each sub matrix
uint iSubA = BLOCKSIZE * gy * width;

// get the number of groups in
int n = get_num_groups(0);

// for each block   
for(int i=0; i< n;i++)
{
  // declare local memory for each sub matrix
  __local float tA[BLOCKSIZE][BLOCKSIZE];

  // copy a portion of the input matrices into the sub matrices
  tA[ly][lx] = Matrix[ly*width + lx + (iSubA + i* BLOCKSIZE)];
  
  // wait for all work-items int the group to finish copying
  barrier(CLK_LOCAL_MEM_FENCE);

  //find out the max22 value 
  __local float a,b,c,d,a1,b1,result;
  a = tA[0][0];
  b = tA[0][1];
  c = tA[1][0];
  d = tA[1][1];
  a1 = max(a,B);
  b1 = max(c,d);
  result = max(a1,b1);
  tA[0][0]= result;
  tA[0][1]= result;
  tA[1][0]= result;
  tA[1][1]= result;
  Matrix_dst[ly*width + lx + (iSubA + i* BLOCKSIZE)] = tA[ly][lx];
}

}

I know it have mistakes, i need help and to correct.
Thank you!

hyzr · October 20, 2010, 9:41am

:rolleyes:
I am a stater of opencl, and l try to wirte the kernel of max22 wihch read the submatrix 2x2 from a big matrix and compute the max value of the four element, then use the max value to fill the submatrix.
i have complete the kernel with the help of nvidia’s sdk. it output the right results.
but i meature the consumed time which is more than the C code consumed.
for example, the Matrix 2000x2000 the consumed time of c code is 31ms, but using my kernel, the time is more than 9000ms.

my kernel is as folllow:
#define BLOCKSIZE 2

// Matrix multiplication kernel called by MatrixMul()
__kernel void MatrixMax22( __global float* Matrix,
uint width, uint height,
__global float* Matrix_dst)
{
uint lx = get_local_id(0);
uint ly = get_local_id(1);
int gx = get_group_id(0);
int gy = get_group_id(1);

// calculate the starting index of the global array for the each sub matrix
uint iSubA = BLOCKSIZE * gy * width;

// get the number of groups in
int n = get_num_groups(0);

// for each block   
for(int i=0; i< n;i++)
{
  // declare local memory for each sub matrix
  __local float tA[BLOCKSIZE][BLOCKSIZE];

  // copy a portion of the input matrices into the sub matrices
  tA[ly][lx] = Matrix[ly*width + lx + (iSubA + i* BLOCKSIZE)];
  
  // wait for all work-items int the group to finish copying
  barrier(CLK_LOCAL_MEM_FENCE);

  //find out the max22 value 
  __local float a,b,c,d,a1,b1,result;
  a = tA[0][0];
  b = tA[0][1];
  c = tA[1][0];
  d = tA[1][1];
  a1 = max(a,B);
  b1 = max(c,d);
  result = max(a1,b1);
  tA[0][0]= result;
  tA[0][1]= result;
  tA[1][0]= result;
  tA[1][1]= result;
  Matrix_dst[ly*width + lx + (iSubA + i* BLOCKSIZE)] = tA[ly][lx];
}

}

I know it have mistakes, i need help and to correct.
Thank you!

karbous · October 21, 2010, 8:04pm

Not sure I understand your code fully, but I suspect you are mixing the blocks and threads. Basically you have a parallel problem which you will assign N threads and divide those N threads into M work-groups. So why is there a loop through blocks if threads and all blocks should make it parallel and only once? (for(int i=0; i< n;i++)) Have a look at NVidia OpenCL Programming Guide, there’s a matrix multiplication described.

Btw I wouldn’t use a block-size 2, because it is too small. I think there are 32 cores per one multiprocessor (at NVidia cards), so block-size lower than 32 are usually waste of the time and computational resources.

I wonder you could compile the code like this, I think you are supposed to define local arrays in scope of the kernel but not in some loop or if.
// declare local memory for each sub matrix
__local float tA[BLOCKSIZE][BLOCKSIZE];
Local memory is a scarce resource, define it only once, outside for loop and reuse it.

karbous · October 21, 2010, 8:04pm

Not sure I understand your code fully, but I suspect you are mixing the blocks and threads. Basically you have a parallel problem which you will assign N threads and divide those N threads into M work-groups. So why is there a loop through blocks if threads and all blocks should make it parallel and only once? (for(int i=0; i< n;i++)) Have a look at NVidia OpenCL Programming Guide, there’s a matrix multiplication described.

Btw I wouldn’t use a block-size 2, because it is too small. I think there are 32 cores per one multiprocessor (at NVidia cards), so block-size lower than 32 are usually waste of the time and computational resources.

I wonder you could compile the code like this, I think you are supposed to define local arrays in scope of the kernel but not in some loop or if.
// declare local memory for each sub matrix
__local float tA[BLOCKSIZE][BLOCKSIZE];
Local memory is a scarce resource, define it only once, outside for loop and reuse it.

hyzr · October 28, 2010, 8:22am

Not sure I understand your code fully, but I suspect you are mixing the blocks and threads. Basically you have a parallel problem which you will assign N threads and divide those N threads into M work-groups. So why is there a loop through blocks if threads and all blocks should make it parallel and only once? (for(int i=0; i< n;i++)) Have a look at NVidia OpenCL Programming Guide, there’s a matrix multiplication described.

Btw I wouldn’t use a block-size 2, because it is too small. I think there are 32 cores per one multiprocessor (at NVidia cards), so block-size lower than 32 are usually waste of the time and computational resources.

I wonder you could compile the code like this, I think you are supposed to define local arrays in scope of the kernel but not in some loop or if.

// declare local memory for each sub matrix

__local float tA[BLOCKSIZE][BLOCKSIZE];

Local memory is a scarce resource, define it only once, outside for loop and reuse it.

Thank for your help!

And i have some doubts about workgroup,e.g. a 1024x1024 matrix is divided into 4096 workgroups with the workgruopsize=256. I konw that every workitem in one workgroup will excuted at the same time.But how the 4096 workgroups will excuted sequential or parallel?

i have understand something about parallel problem, i think my algorithm is not suite for the opencl parallel compute.

my code is to compute the max value of every submatrix 2x2 , and then fill in the submatrix with the max value. So i think the process of compute the max value can never be parallelise. It must be a sequential comparition among the 4 elements in submatrix 2x2 with 3 times max() function.

what’s your opinion?

Thank you!

hyzr · October 28, 2010, 8:22am

Not sure I understand your code fully, but I suspect you are mixing the blocks and threads. Basically you have a parallel problem which you will assign N threads and divide those N threads into M work-groups. So why is there a loop through blocks if threads and all blocks should make it parallel and only once? (for(int i=0; i< n;i++)) Have a look at NVidia OpenCL Programming Guide, there’s a matrix multiplication described.

Btw I wouldn’t use a block-size 2, because it is too small. I think there are 32 cores per one multiprocessor (at NVidia cards), so block-size lower than 32 are usually waste of the time and computational resources.

I wonder you could compile the code like this, I think you are supposed to define local arrays in scope of the kernel but not in some loop or if.

// declare local memory for each sub matrix

__local float tA[BLOCKSIZE][BLOCKSIZE];

Local memory is a scarce resource, define it only once, outside for loop and reuse it.

Thank for your help!

And i have some doubts about workgroup,e.g. a 1024x1024 matrix is divided into 4096 workgroups with the workgruopsize=256. I konw that every workitem in one workgroup will excuted at the same time.But how the 4096 workgroups will excuted sequential or parallel?

i have understand something about parallel problem, i think my algorithm is not suite for the opencl parallel compute.

my code is to compute the max value of every submatrix 2x2 , and then fill in the submatrix with the max value. So i think the process of compute the max value can never be parallelise. It must be a sequential comparition among the 4 elements in submatrix 2x2 with 3 times max() function.

what’s your opinion?

Thank you!

karbous · October 30, 2010, 8:31am

You are welcome:-)

Worgroups are mentally parallel. However, if there aren’t sufficient number of multiprocessors, the must be done sequentially. You can’t say which will be done sooner and which of them later. So imagine them as running parallel. (For more details see CUDA_Occupancy_calculator.xls)

I guess your problem is perfectly parallel. For example: for input matrix 1024x1024 you need that every work-item computes maximum of submatrix 2x2, right? So you just specify global size as 512x512 and local size don’t matter to you (no need to synchronize between thread). The naive parallel kernel code can look similar to (suppose your source matrix M is saved as M[0,0] M[1,0] M[2,0]…):
__kernel subMax (__global float* Matrix, __global float* Matrix_dst ){
int GID0 = get_global_id(0);
int GID1 = get_global_id(1);
int width = get_global_size(0);

int a,b,c,d;
a = Matrix[2GID1width + 2GID0];
b = Matrix[2GID1width + 2GID0 + 1];
c = Matrix[2GID1width + 2width + 2GID0];
d = Matrix[2GID1width + 2width + 2GID0 + 1];

a = max(a,B);
a = max(a,c);
a = max(a,d);

Matrix_dst[GID1*width + GID0] = a;
}

As I said it is naive solution. You should pay more attention how the data from global memory is read, so rearrange input matrix to have better access scheme. However, you can start with this working naive solution and make it run faster afterwards ;-)

karbous · October 30, 2010, 8:31am

You are welcome:-)

Worgroups are mentally parallel. However, if there aren’t sufficient number of multiprocessors, the must be done sequentially. You can’t say which will be done sooner and which of them later. So imagine them as running parallel. (For more details see CUDA_Occupancy_calculator.xls)

I guess your problem is perfectly parallel. For example: for input matrix 1024x1024 you need that every work-item computes maximum of submatrix 2x2, right? So you just specify global size as 512x512 and local size don’t matter to you (no need to synchronize between thread). The naive parallel kernel code can look similar to (suppose your source matrix M is saved as M[0,0] M[1,0] M[2,0]…):
__kernel subMax (__global float* Matrix, __global float* Matrix_dst ){
int GID0 = get_global_id(0);
int GID1 = get_global_id(1);
int width = get_global_size(0);

int a,b,c,d;
a = Matrix[2GID1width + 2GID0];
b = Matrix[2GID1width + 2GID0 + 1];
c = Matrix[2GID1width + 2width + 2GID0];
d = Matrix[2GID1width + 2width + 2GID0 + 1];

a = max(a,B);
a = max(a,c);
a = max(a,d);

Matrix_dst[GID1*width + GID0] = a;
}

As I said it is naive solution. You should pay more attention how the data from global memory is read, so rearrange input matrix to have better access scheme. However, you can start with this working naive solution and make it run faster afterwards ;-)

hyzr · November 3, 2010, 1:58pm

You are welcome:-)

Worgroups are mentally parallel. However, if there aren’t sufficient number of multiprocessors, the must be done sequentially. You can’t say which will be done sooner and which of them later. So imagine them as running parallel. (For more details see CUDA_Occupancy_calculator.xls)

I guess your problem is perfectly parallel. For example: for input matrix 1024x1024 you need that every work-item computes maximum of submatrix 2x2, right? So you just specify global size as 512x512 and local size don’t matter to you (no need to synchronize between thread). The naive parallel kernel code can look similar to (suppose your source matrix M is saved as M[0,0] M[1,0] M[2,0]…):

__kernel subMax (__global float* Matrix, __global float* Matrix_dst ){

int GID0 = get_global_id(0);

int GID1 = get_global_id(1);

int width = get_global_size(0);

int a,b,c,d;

a = Matrix[2GID1width + 2*GID0];

b = Matrix[2GID1width + 2*GID0 + 1];

c = Matrix[2GID1width + 2width + 2GID0];

d = Matrix[2GID1width + 2width + 2GID0 + 1];

a = max(a,B);

a = max(a,c);

a = max(a,d);

Matrix_dst[GID1*width + GID0] = a;

}

As I said it is naive solution. You should pay more attention how the data from global memory is read, so rearrange input matrix to have better access scheme. However, you can start with this working naive solution and make it run faster afterwards ;-)

It is a good idea i think,i am confused the conception parallel sequential. and i have done the kernel but it also have the some problem, i try to run your code which will be better with local mem.

Thank youï¼

hyzr · November 3, 2010, 1:58pm

You are welcome:-)

Worgroups are mentally parallel. However, if there aren’t sufficient number of multiprocessors, the must be done sequentially. You can’t say which will be done sooner and which of them later. So imagine them as running parallel. (For more details see CUDA_Occupancy_calculator.xls)

I guess your problem is perfectly parallel. For example: for input matrix 1024x1024 you need that every work-item computes maximum of submatrix 2x2, right? So you just specify global size as 512x512 and local size don’t matter to you (no need to synchronize between thread). The naive parallel kernel code can look similar to (suppose your source matrix M is saved as M[0,0] M[1,0] M[2,0]…):

__kernel subMax (__global float* Matrix, __global float* Matrix_dst ){

int GID0 = get_global_id(0);

int GID1 = get_global_id(1);

int width = get_global_size(0);

int a,b,c,d;

a = Matrix[2GID1width + 2*GID0];

b = Matrix[2GID1width + 2*GID0 + 1];

c = Matrix[2GID1width + 2width + 2GID0];

d = Matrix[2GID1width + 2width + 2GID0 + 1];

a = max(a,B);

a = max(a,c);

a = max(a,d);

Matrix_dst[GID1*width + GID0] = a;

}

As I said it is naive solution. You should pay more attention how the data from global memory is read, so rearrange input matrix to have better access scheme. However, you can start with this working naive solution and make it run faster afterwards ;-)

It is a good idea i think,i am confused the conception parallel sequential. and i have done the kernel but it also have the some problem, i try to run your code which will be better with local mem.

Thank youï¼