global void my_kernel
(
float *A,
float *B,
float *C,
float *D,
int sz1,
int sz2,
int sz3,
int sz4
)
{
int i_l, j_l, k_l;
float *A_lp, *B_lp;
float *temp_lp, *temp2_lp;
float sum0_l = 0;
int index = 0;
for (i_l = 0; i_l < sz2; i_l++)
{
A_lp = &A[i_l * sz1];
B_lp = &B[0];
for (j_l = 0; j_l < sz1; j_l++)
{
temp_lp = A_lp;
temp2_lp = B_lp;
sum0_l = 0;
for (k_l = 0; k_l < sz3; k_l++)
{
sum0_l += temp_lp[0] * temp2_lp[0];
temp_lp = temp_lp + (sz4 * sz1);
temp2_lp = temp2_lp + sz1;
}
if (sum0_l < 0)
{
sum0_l *= C[0];
}
D[index++] = sum0_l;
A_lp = A_lp + 1;
B_lp = B_lp + 1;
}
}
}
This is the basic implementation of my function kernel. A,B,C,D are device pointers.
I want to optimize this as best as possible. Any suggestions/ help will be grateful .
Hi , I’m very new to cuda programming.
can some one please help me with this .
if anyone knows please give me some inputs for implementing the above code using cublas APIs.
global void my_kernel
(
float *A,
float *B,
float *C,
float *D,
int sz1,
int sz2,
int sz3,
int sz4
)
{
int i_l, j_l, k_l;
float *A_lp, *B_lp;
float *temp_lp, *temp2_lp;
float sum0_l = 0;
int index = 0;
for (i_l = 0; i_l < sz2; i_l++)
{
A_lp = &A[i_l * sz1];
B_lp = &B[0];
for (j_l = 0; j_l < sz1; j_l++)
{
temp_lp = A_lp;
temp2_lp = B_lp;
sum0_l = 0;
for (k_l = 0; k_l < sz3; k_l++)
{
sum0_l += temp_lp[0] * temp2_lp[0];
temp_lp = temp_lp + (sz4 * sz1);
temp2_lp = temp2_lp + sz1;
}
if (sum0_l < 0)
{
sum0_l *= C[0];
}
D[index++] = sum0_l;
A_lp = A_lp + 1;
B_lp = B_lp + 1;
}
}
}
Thanks for your suggestion . This is my code,
Inside other function I’m calling this kernel with griddim=1, block dim = 1. I’m not sure how to parallelize this using threads and blocks .
Then you should start by learning some basic cuda fundamentals. I’ve already given you a link that can help you get started. You can cover that whole course in 20-30 hours. At a minimum, you should start by understanding how to write a proper parallel vector add code from scratch, before trying to tackle more complicated things.
1 Like