Effective Parallelisation of CUDA C code

Hello Everyone, I hope everyone is safe and healthy in this tough covid conditions.

Actually I have been given a complex C code for doing various calculations. I have written the corresponding CUDA C code for doing those calculations. Now my job is to parallelise the code so that computational time decreases. I have done some basic examples of parallelisation also. I need some suggestions as to how to start doing parallelisation in my CUDA code snippet which is attached below:
A big heartiest thanks in advance.

//////////////Pre-calculation of tot_cnlm_sum_p for every atom/////////////////
 
for(h=0; h<natoms; h++)
 {
  for(jt=0; jt<natoms;jt++)
  {
   if(h != jt)
   {
  dx=(x11[h]-x11[jt]);
  dy=(y11[h]-y11[jt]);
  dz=(z11[h]-z11[jt]);
  r2 =dx*dx + dy*dy + dz*dz;
  r=sqrtf(r2);
  pcolhea1 = acosf(dz/r);

  for (l=0;l<lmax;l++)
  {
   if (l==0)
   {
    Lmn[0]=1.0;
    Lmn_all[jt][0][0] = 1.0;
   }
   else
   {
    legendre(lmax,l,pcolhea1,Lmn);
   }
     for(b_m=0;b_m<19;b_m++)
   {
    if(b_m<2*l+1)
    {
     m= -l + b_m;
     if (m >= 0)
     {
      Lmn_all[jt][l][m]= Lmn[m];
     }
    }
   }
  }
   }
  }
  k2=9;
  for(n=0;n<etaang;n++) 
  {
   for(ll=0;ll<lmax;ll++)
   {
    p_ch_p= make_cuDoubleComplex(0,0);	
    for(b_mm=0;b_mm<19;b_mm++)
    {
     if(b_mm<2*ll+1)
     {
      xt2  = make_cuDoubleComplex(0,0);
      for(ju=0; ju<natoms;ju++)
      {
       if(h != ju)
       {
        dx=(x11[h]-x11[ju]);
	dy=(y11[h]-y11[ju]);
        dz=(z11[h]-z11[ju]);
        r2 =dx*dx + dy*dy + dz*dz;
        r=sqrtf(r2);
        fophi1= atan2f(dy,dx);
        if (r <= rcut)
        {
         rcutfun = make_cuDoubleComplex(0.5*(cosf(pi*r*rcut_inv)+1),0);
        }
        else
        {
         rcutfun = make_cuDoubleComplex(0,0);
        }
        //func1   = expf(-netaa[n]*r2);
         temp50 = -netaa[n]*r2;
	temp51 = make_cuDoubleComplex(temp50,0);	
        func1 = my_complex_exp (temp51);	
        term1   = cuCmul(func1,rcutfun);
        G_p = term1;
        mm = -ll+b_mm;
        m_1 = abs(mm);
        m_1_1 = m_1*hv;
         if (mm<0)
        {
       temp1 = cuCmul(q, make_cuDoubleComplex(m_1_1,0));
       ex1 = cuCmul(temp1, make_cuDoubleComplex(fophi1,0));
       comp_1_w = my_complex_exp(ex1);
        }
        else
        {
       temp2 = cuCmul(q, make_cuDoubleComplex(m_1,0));
       ex2 = cuCmul(temp2, make_cuDoubleComplex(fophi1,0));
       ex3 = my_complex_exp(ex2);
       float result = 1.0;
       int zz;
       for(zz=9; zz>=1; zz--)
       {
        if(zz<=mm)
        {
          result *= (hv);
        }
       }
         comp_1_w = cuCmul(ex3, make_cuDoubleComplex(result, 0));
        }
		cons1_temp = cons1[ll][b_mm];
        mult1 = cuCmul(comp_1_w, make_cuDoubleComplex(cons1_temp, 0));	
        Lmmnn = Lmn_all[ju][ll][m_1];
         useY1 = cuCmul(mult1, make_cuDoubleComplex(Lmmnn, 0));
         mult4 = cuCmul(useY1,G_p);
         xt2 = cuCadd(xt2,mult4);
	}
      }
      xt6 = cuConj(xt2);
      temp3 = cuCmul(xt2,xt6);
      p_ch_p = cuCadd(p_ch_p,temp3);    //p_ch_p = p_ch_p + (xt2 * xt6);
      tot_cnlm_sum_p[h][n][ll][b_mm] = xt2;
     }
    }
  sub1 = 1/(gtotmax_f[k2]-gtotmin_f[k2]);
    cons2 = ((4*pi)/(float)(2*ll+1));
    temp4 = cuCmul(p_ch_p,make_cuDoubleComplex(cons2,0));
    temp5[k2] = meangtot_f[k2] + gtotmin_f[k2];
	temp100 = temp5[k2];
    temp6 = cuCsub(temp4, make_cuDoubleComplex(temp100,0));
    GiS_p[h][k2] = cuCmul(temp6, make_cuDoubleComplex(sub1,0));
    k2=k2+1;
   }
  }
 }

When posting code on these forums, please use proper formatting. At a minimum, that means selecting your code after pasting it in, and pressing the </> button at the top of the edit window. You can edit your posted code to fix this.

@Robert_Crovella Yeah, I have edited the code and now it is in proper formatting. Please help me with my concern.
Thank you.

Hi harshit96,
the last time you asked this question (Please help for using complex numbers in cuda code - #13 by harshit96) you were given some advice. How did that turn out?

No, you haven’t followed my instructions, and have not done it correctly.

@striker159 Hi Sir, yeah last time I asked how to write that C code into corresponding CUDA code. and after debugging , that CUDA code is running perfectly fine. Actually now I want to parallelise that code so as to decrease the computational time. I have attached the snippet of the part which is intensive having multiple for loops. Can you please give your valuable suggestions. It would be much help to me.
Thanks

@Robert_Crovella I tried but was not able to get your point. Please help me with my concern.

  1. click the button below your post to edit your post.
  2. using click-drag to select a range of text, select the code
  3. click the </> button at the top of the edit window
  4. save your edits.

Yeah I edited in this way. Will take care of these steps from next time. Thank you.
Please help me with my concern, it would be really helpful for me.
Thanks for your time

The usual way to begin to parallelize such a code would be across one or both of the two outer loops.

Can you please once show me by writing few lines like how to start with one external loop using thread/blocks. With this idea, I would try to parallelise my complete code.
Thank You in Advance.

That structure at the beginning of the code often suggests parallelization like this:

__global__ void k(...){
 
  int h = blockIdx.x*blockDim.x+threadIdx.x;
  ... // obviously there must be more code here
  if (h < natoms)
    for(jt=0; jt<natoms;jt++)
    {
    // and all the rest of the code is approximately the same

    }
}

You would launch such a kernel like so:

  int threads = 256;
  int blocks = natoms+threads-1/threads;
  k<<<blocks, threads>>>(...);

Thank you so much for sparing your time and helping me. Would you please tell me how you have decided the number of blocks (like here natoms + threads - 1/threads).

This is basic CUDA programming. Please spend some time here. Review the first 4 sessions, at least. This particular idea is covered in many questions on the internet forums as well, such as this and you will find a similar construct in the CUDA vectorAdd sample code.

Okay I will first go through these links shared by you.
Thank you so much.
Really Appreciate it.

@Robert_Crovella Hi, I tried to implement above code with the help of your suggestions. Below is the part of the code to be run on GPU.

__global__ void fc(float *Lmn,float *Lmn_all){
int h = blockIdx.x*blockDim.x+threadIdx.x;

 //////////////Pre-calculation of tot_cnlm_sum_p for every atom/////////////////
 if(h<natoms)
 {
  for(jt=0; jt<natoms;jt++)
  {
   if(h != jt)
   {
  dx=(x11[h]-x11[jt]);
  dy=(y11[h]-y11[jt]);
  dz=(z11[h]-z11[jt]);
  r2 =dx*dx + dy*dy + dz*dz;
  r=sqrtf(r2);
  pcolhea1 = acosf(dz/r);
  for (l=0;l<lmax;l++)
  {
   if (l==0)
   {
    Lmn[0]=1.0;
    Lmn_all[jt][0][0] = 1.0;
   }
   else
   {
    legendre(lmax,l,pcolhea1,Lmn);
   }
     for(b_m=0;b_m<19;b_m++)
   {
    if(b_m<2*l+1)
    {
     m= -l + b_m;
     if (m >= 0)
     {
      Lmn_all[jt][l][m]= Lmn[m];
     }
    }
   }
  }
   }
  }
  cuDoubleComplex temp4,temp6;
  //float temp5[9];
  k2=0;
  for(n=0;n<etaang;n++) // loop 15.2
  {
   for(ll=0;ll<lmax;ll++)
   {
    p_ch_p= make_cuDoubleComplex(0,0);
    for(b_mm=0;b_mm<19;b_mm++)
    {
     if(b_mm<2*ll+1)
     {
      xt2  = make_cuDoubleComplex(0,0);
      for(ju=0; ju<natoms;ju++)
      {
       if(h != ju)
       {
        dx=(x11[h]-x11[ju]);
        dy=(y11[h]-y11[ju]);
        dz=(z11[h]-z11[ju]);
        r2 =dx*dx + dy*dy + dz*dz;
        r=sqrtf(r2);
        fophi1= atan2f(dy,dx);
        if (r <= rcut)
        {
         rcutfun = 0.5*(cosf(pi*r*rcut_inv)+1);
        }
        else
        {
         rcutfun = 0.0;
        }

        func1   = expf(-netaa[n]*r2);
        term1   = func1*rcutfun;
        G_p    = term1;
        mm = -ll+b_mm;
        m_1 = abs(mm);
        if (mm<0)
        {
       cuDoubleComplex temp1 = cuCmul(q, make_cuDoubleComplex(-m_1,0));
       cuDoubleComplex ex1 = cuCmul(temp1, make_cuDoubleComplex(fophi1,0));
       cuDoubleComplex comp_1_w = my_complex_exp(ex1);
       //comp_1_w = cexpf(ex1);
        }
        else
        {
       cuDoubleComplex temp2 = cuCmul(q, make_cuDoubleComplex(m_1,0));
       cuDoubleComplex ex2 = cuCmul(temp2, make_cuDoubleComplex(fophi1,0));
       cuDoubleComplex ex3 = my_complex_exp (ex2);
       
       //ex3 = cexpf(ex2);
       float result = 1.0;
       int zz;
       for(zz=9; zz>=1; zz--)
       {
        if(zz<=mm)
        {
          result *= (-1.0);
        }
       }
         cuDoubleComplex comp_1_w = cuCmul(ex3, make_cuDoubleComplex(result, 0));
        }
         cuDoubleComplex mult1 = cuCmul(comp_1_w, make_cuDoubleComplex(cons1[11][b_mm], 0));
        Lmmnn = Lmn_all[ju][ll][m_1];
         cuDoubleComplex useY1 = cuCmul(mult1, make_cuDoubleComplex(Lmmnn, 0));
         cuDoubleComplex mult4 = cuCmul(useY1, make_cuDoubleComplex(G_p, 0));
        cuDoubleComplex xt2 = cuCadd(xt2,mult4);
        //xt2 = xt2 +  mult4;
       }
      }
      xt6 = cuConj(xt2);
      cuDoubleComplex temp3 = cuCmul(xt2,xt6);
      p_ch_p = cuCadd(p_ch_p,temp3);
      //p_ch_p = p_ch_p + (xt2 * xt6);
      tot_cnlm_sum_p[h][n][ll][b_mm] = xt2;
     }
    }
    
  sub1 = 1/(gtotmax_f[k2]-gtotmin_f[k2]);
    cons2 = ((4*pi)/(float)(2*ll+1));
     temp4 = cuCmul(p_ch_p,make_cuDoubleComplex(cons2,0));
    temp5[k2] = meangtot_f[k2] + gtotmin_f[k2];
    temp6 = cuCsub(temp4, make_cuDoubleComplex(temp5[k2],0));
    GiS_p[h][k2] = cuCmul(temp6, make_cuDoubleComplex(sub1,0));
    //GiS_p[h][k2] = ((cons2 * p_ch_p)- meangtot_f[k2]-gtotmin_f[k2])*(sub1);
    k2=k2+1;
   }
  }
 }
}

But in the above snippet, I am getting error in following 2 lines:

Lmn_all[jt][0][0] = 1.0;
Lmn_all[jt][l][m]= Lmn[m];

The error message says that expression must have pointer-to-object type
I searched on stack overflow also but I did not found any solution.
Please help me in removing these 2 errors.
Thank you so much.

This has nothing to do with CUDA and is purely a function of understanding C++ programming. I suggest you brush up on your C++ programming skills.

In C++, if you declare a pointer like this:

float *Lmn_all

You can, at most, dereference that pointer once:

 Lmn_all[z] = x;

You cannot dereference that pointer three times:

Lmn_all[jt][0][0]
        ^   ^  ^ 
        |   |  Third dereference
        |   Second dereference
        First Dereference

My time that I can spend on these forums is limited. I need to prioritize it for CUDA questions. I won’t be able to respond to further questions of this type.

Thanks for replying. I will stick with only CUDA related questions on this platform. Those errors related to multidimensional arrays have been removed by correct way of dereferencing.

In my CUDA code, there are many variables which are used inside__global__ kernel as a part of code to be run on GPU but they were declared globally in remaining part of C code.
Do I have to create copy of each variable inside __global __ kernel as I am getting following errors:

so do I have to create separate variables such as dev_natoms, dev_jt, dev_x11 etc.

@Robert_Crovella Hello Sir, I have gone through the documentation and successfully removed all the errors. Now the thing is that output values are coming as zero, so I think there might be some problem with intermediate steps/ variables for which I will have to do debugging by printing and checking values from top to bottom. I want to ask that some variables are declared and used inside __global __ kernel. I want to check whether these variables are getting correct values or not. what is the correct way of doing that as I can’t directly use print statement.
Please give your valuable suggestion, I am able to learn a lot with that.
Thank You

I’m not sure why that would be. I generally don’t have trouble using printf in-kernel.

Before entering what I would call “typical debug”, I would ensure that:

  1. I am doing rigorous error checking, and that no errors are being reported.
  2. My code reports no errors when run with compute-sanitizer or cuda-memcheck (one or the other, depending on your GPU type). If errors are reported at this step, I would probably use the method described here to localize those errors, in an effort to sort them out.

After completing those steps successfully, I don’t think you’ll have any trouble using in-kernel printf, to proceed with “typical debug”.

A tutorial on basic CUDA debugging is given in session 12 of the online training series I had previously mentioned in my post on November 13th in this thread