**global** void my_kernel

(

float *A,

float *B,

float *C,

float *D,

int sz1,

int sz2,

int sz3,

int sz4

)

{

int i_l, j_l, k_l;

float *A_lp, *B_lp;

float *temp_lp, *temp2_lp;

float sum0_l = 0;

int index = 0;

for (i_l = 0; i_l < sz2; i_l++)

{

A_lp = &A[i_l * sz1];

B_lp = &B[0];

for (j_l = 0; j_l < sz1; j_l++)

{

temp_lp = A_lp;

temp2_lp = B_lp;

sum0_l = 0;

for (k_l = 0; k_l < sz3; k_l++)

{

sum0_l += temp_lp[0] * temp2_lp[0];

temp_lp = temp_lp + (sz4 * sz1);

temp2_lp = temp2_lp + sz1;

}

if (sum0_l < 0)

{

sum0_l *= C[0];

}

D[index++] = sum0_l;

A_lp = A_lp + 1;

B_lp = B_lp + 1;

}

}

}

This is the basic implementation of my function kernel. A,B,C,D are device pointers.

I want to optimize this as best as possible. Any suggestions/ help will be grateful .

Hi , I’m very new to cuda programming.

can some one please help me with this .

if anyone knows please give me some inputs for implementing the above code using cublas APIs.

```
global void my_kernel
(
float *A,
float *B,
float *C,
float *D,
int sz1,
int sz2,
int sz3,
int sz4
)
{
int i_l, j_l, k_l;
float *A_lp, *B_lp;
float *temp_lp, *temp2_lp;
float sum0_l = 0;
int index = 0;
for (i_l = 0; i_l < sz2; i_l++)
{
A_lp = &A[i_l * sz1];
B_lp = &B[0];
for (j_l = 0; j_l < sz1; j_l++)
{
temp_lp = A_lp;
temp2_lp = B_lp;
sum0_l = 0;
for (k_l = 0; k_l < sz3; k_l++)
{
sum0_l += temp_lp[0] * temp2_lp[0];
temp_lp = temp_lp + (sz4 * sz1);
temp2_lp = temp2_lp + sz1;
}
if (sum0_l < 0)
{
sum0_l *= C[0];
}
D[index++] = sum0_l;
A_lp = A_lp + 1;
B_lp = B_lp + 1;
}
}
}
```

Thanks for your suggestion . This is my code,

Inside other function I’m calling this kernel with griddim=1, block dim = 1. I’m not sure how to parallelize this using threads and blocks .

Then you should start by learning some basic cuda fundamentals. I’ve already given you a link that can help you get started. You can cover that whole course in 20-30 hours. At a minimum, you should start by understanding how to write a proper parallel vector add code from scratch, before trying to tackle more complicated things.

1 Like