Hwo to optimize iterative code with a lot of data transferring

I have a code which I attached at the end and I think it is not efficient. Compared with the corresponding Matlab code, the speedup of this code is 7-15. As the function global static void FunctionEvaluation becomes more complicated, I got more speedup. ( this is what we expect)

But the speedup over MATLAB drops when there are more segments, or the “N”, which is size of the matrix and vector, increase. ( This is a little surprising as I thought GPU could be more efficient than MATLAB for large matrix-vector computation)

And I think followings are some possible inefficiencies there:

• 1. Some are run on CPU and some on GPU, and this situation switches many times.

• 2.All the matrix-vector computation are computed using CUBLAS, but variable definition and saving, and checking the condition are done on CPU.

Some explanations of the code:

• There are two loops here: outer loop is run through segment, inner loop is through iteration. (for the 7-15 speedup, outer loop run once and inner loop run 12 times)

• A vector called h_V0_x will be reset after each inner loop by a scalar called vx0_segment.

• The inner loop called a global static void FunctionEvaluation, which computes some forces on N+1 threads, and this is the place to hope to get GPU benefits the most.

• After getting d_Vxnew, its difference from the old value d_Vxold is computed, and this difference is used as the condition of two branches. There is another embedded branch.

• After existing the inner loop, a new value of vx0_segment is set. A vector called h_a_x is computed and saved in a larger matrix called a_x_his.

Thank you very much for any comment or suggestion!

for (segMent=1;segMent<Nsegment+1;segMent++){

    for (i=0;i<=N;i++){

        h_V0_x[i]=vx0_segment;%%%%define a constant vector, reset vx0_segment at each iteartion

        }

    status = cublasSetVector(dimension, sizeof(h_X0_x[0]), h_X0_x, 1, d_X0_x, 1);

   for(iterations=1;iterations<maxIteration;iterations++){

            NumFuncCall= NumFuncCall+1;

            FunctionEvaluation<<<Block_Num,Thread_Num,0>>>(d_Xold,d_Yold,d_Zold,d_fvx,d_fvy,d_fvz);%%%   FunctionEvaluation is __global__ static void

            cublasScopy(N+1,d_V0_x,1,d_VXnew,1);%%%d_VXnew=d_V0_x;

            cublasSgemv('n', dimension, dimension, 1, d_Cbf, dimension, d_fvx, 1, 1, d_VXnew, 1);%%d_VXnew=d_Cbf*d_fvx+d_VXnew;

            cublasScopy(N+1,d_X0_x,1,d_Xold,1);%%d_Xold=d_X0_x;

            cublasSaxpy(dimension,-1,d_VXnew,1,d_VXold,1);%%d_VXold=d_VXold-d_VXnew

            maxIndex_vx=cublasIsamax (dimension, d_VXold, 1)-1;%%%find the max of the magnitude

            status = cublasGetVector(dimension, sizeof(h_VXold[0]), d_VXold, 1, h_VXold, 1);

            maxE=abs(h_VXold[maxIndex_vx]);

            if(iterations==1){

                e_old=maxE;

                cublasScopy(N+1,d_VXnew,1,d_VXold,1);%%d_VXold=d_VXnew

                cublasSgemv('n', dimension, dimension, 1, d_CbfT2tao, dimension, d_VXnew, 1, 1, d_Xold, 1); }%%d_Xold=d_CbfT2tao*d_VXnew+d_Xold

            else{

                    e_new=maxE;

                    cublasScopy(N+1,d_VXnew,1,d_VXold,1);

                    cublasSgemv('n', dimension, dimension, 1, d_CbfT2tao, dimension, d_VXnew, 1, 1, d_Xold, 1);

                    if(e_new<deltaLimit && e_old<deltaLimit){

                        break;

                    else

                        e_old=e_new;

                    }                        

    }

    status = cublasGetVector(N+1, sizeof(h_Xold[0]), d_Xold, 1, h_Xold, 1);%%%after existing the iteration, reset vx0_segment and compute/save h_a_x

    x0_segment=h_Xold[0];

    cublasSgemv('n', dimension, dimension, 1, d_TV, dimension, d_Xold, 1, 0, d_a_x, 1);%%%d_a_x=d_TV*d_Xold;

    status = cublasGetVector(N+1, sizeof(h_a_x[0]), d_a_x, 1, h_a_x, 1);

    for(i=0;i<=N;i++){

        a_x_his[(N+1)*(segMent-1)+i]=h_a_x[i];%%%save h_a_x of each segment

    }

   if(iterations==maxIteration){

            printf("max iteration did not converge");

            break; 

    }

  }