Hwo to optimize iterative code with a lot of data transferring

I have a code which I attached at the end and I think it is not efficient. Compared with the corresponding Matlab code, the speedup of this code is 7-15. As the function global static void FunctionEvaluation becomes more complicated, I got more speedup. ( this is what we expect)

But the speedup over MATLAB drops when there are more segments, or the “N”, which is size of the matrix and vector, increase. ( This is a little surprising as I thought GPU could be more efficient than MATLAB for large matrix-vector computation)

And I think followings are some possible inefficiencies there:

• 1. Some are run on CPU and some on GPU, and this situation switches many times.

• 2.All the matrix-vector computation are computed using CUBLAS, but variable definition and saving, and checking the condition are done on CPU.

Some explanations of the code:

• There are two loops here: outer loop is run through segment, inner loop is through iteration. (for the 7-15 speedup, outer loop run once and inner loop run 12 times)

• A vector called h_V0_x will be reset after each inner loop by a scalar called vx0_segment.

• The inner loop called a global static void FunctionEvaluation, which computes some forces on N+1 threads, and this is the place to hope to get GPU benefits the most.

• After getting d_Vxnew, its difference from the old value d_Vxold is computed, and this difference is used as the condition of two branches. There is another embedded branch.

• After existing the inner loop, a new value of vx0_segment is set. A vector called h_a_x is computed and saved in a larger matrix called a_x_his.

Thank you very much for any comment or suggestion!

for (segMent=1;segMent<Nsegment+1;segMent++){

    for (i=0;i<=N;i++){

        h_V0_x[i]=vx0_segment;%%%%define a constant vector, reset vx0_segment at each iteartion


    status = cublasSetVector(dimension, sizeof(h_X0_x[0]), h_X0_x, 1, d_X0_x, 1);


            NumFuncCall= NumFuncCall+1;

            FunctionEvaluation<<<Block_Num,Thread_Num,0>>>(d_Xold,d_Yold,d_Zold,d_fvx,d_fvy,d_fvz);%%%   FunctionEvaluation is __global__ static void


            cublasSgemv('n', dimension, dimension, 1, d_Cbf, dimension, d_fvx, 1, 1, d_VXnew, 1);%%d_VXnew=d_Cbf*d_fvx+d_VXnew;



            maxIndex_vx=cublasIsamax (dimension, d_VXold, 1)-1;%%%find the max of the magnitude

            status = cublasGetVector(dimension, sizeof(h_VXold[0]), d_VXold, 1, h_VXold, 1);





                cublasSgemv('n', dimension, dimension, 1, d_CbfT2tao, dimension, d_VXnew, 1, 1, d_Xold, 1); }%%d_Xold=d_CbfT2tao*d_VXnew+d_Xold




                    cublasSgemv('n', dimension, dimension, 1, d_CbfT2tao, dimension, d_VXnew, 1, 1, d_Xold, 1);

                    if(e_new<deltaLimit && e_old<deltaLimit){






    status = cublasGetVector(N+1, sizeof(h_Xold[0]), d_Xold, 1, h_Xold, 1);%%%after existing the iteration, reset vx0_segment and compute/save h_a_x


    cublasSgemv('n', dimension, dimension, 1, d_TV, dimension, d_Xold, 1, 0, d_a_x, 1);%%%d_a_x=d_TV*d_Xold;

    status = cublasGetVector(N+1, sizeof(h_a_x[0]), d_a_x, 1, h_a_x, 1);


        a_x_his[(N+1)*(segMent-1)+i]=h_a_x[i];%%%save h_a_x of each segment



            printf("max iteration did not converge");