How can I optimize the following cuda kernel

global void my_kernel
(
float *A,
float *B,
float *C,
float *D,
int sz1,
int sz2,
int sz3,
int sz4
)

{
int i_l, j_l, k_l;
float *A_lp, *B_lp;
float *temp_lp, *temp2_lp;
float sum0_l = 0;
int index = 0;
for (i_l = 0; i_l < sz2; i_l++)
{
A_lp = &A[i_l * sz1];
B_lp = &B[0];
for (j_l = 0; j_l < sz1; j_l++)
{
temp_lp = A_lp;
temp2_lp = B_lp;
sum0_l = 0;
for (k_l = 0; k_l < sz3; k_l++)
{
sum0_l += temp_lp[0] * temp2_lp[0];
temp_lp = temp_lp + (sz4 * sz1);
temp2_lp = temp2_lp + sz1;
}
if (sum0_l < 0)
{
sum0_l *= C[0];
}
D[index++] = sum0_l;
A_lp = A_lp + 1;
B_lp = B_lp + 1;
}
}
}

This is the basic implementation of my function kernel. A,B,C,D are device pointers.
I want to optimize this as best as possible. Any suggestions/ help will be grateful .

Hi , I’m very new to cuda programming.
can some one please help me with this .

if anyone knows please give me some inputs for implementing the above code using cublas APIs.

  • In general, when posting code on these forums, I suggest using the code formatting tools. As a simple suggestion, select code in the edit window, then press the </> button at the top of the edit window. You can edit your code now to fix this.

  • right now, your posted code doesn’t have certain aspects that I would expect for any CUDA kernel, such as using built-in variables. There are tutorials available to learn CUDA (such as here) and my suggestion is that you have some basic understanding of how to write a CUDA kernel.

  • cublas is a library that implements (largely) a standard BLAS-like API. If you want to learn how to use cublas, learning how to craft your code as a sequence of BLAS calls would be an excellent starting point. Likewise, if you want help, describing the underlying linear algebra that your code is actually doing might be helpful, if you believe cublas is a possible option.

1 Like
global void my_kernel
(
    float *A,
    float *B,
    float *C,
    float *D,
    int sz1,
    int sz2,
    int sz3,
    int sz4
)

{
    int i_l, j_l, k_l;
    float *A_lp, *B_lp;
    float *temp_lp, *temp2_lp;
    float sum0_l = 0;
    int index = 0;
    for (i_l = 0; i_l < sz2; i_l++)
     {
       A_lp = &A[i_l * sz1];
       B_lp = &B[0];
       for (j_l = 0; j_l < sz1; j_l++)
      {
           temp_lp = A_lp;
           temp2_lp = B_lp;
           sum0_l = 0;
           for (k_l = 0; k_l < sz3; k_l++) 
           { 
                  sum0_l += temp_lp[0] * temp2_lp[0];
                 temp_lp = temp_lp + (sz4 * sz1);
                 temp2_lp = temp2_lp + sz1;
            }
           if (sum0_l < 0)
           {
                 sum0_l *= C[0];
           }
          D[index++] = sum0_l;
          A_lp = A_lp + 1;
          B_lp = B_lp + 1;
          }
    }
}

Thanks for your suggestion . This is my code,

Inside other function I’m calling this kernel with griddim=1, block dim = 1. I’m not sure how to parallelize this using threads and blocks .

Then you should start by learning some basic cuda fundamentals. I’ve already given you a link that can help you get started. You can cover that whole course in 20-30 hours. At a minimum, you should start by understanding how to write a proper parallel vector add code from scratch, before trying to tackle more complicated things.

1 Like