# Hwo to optimize iterative code with a lot of data transferring

I have a code which I attached at the end and I think it is not efficient. Compared with the corresponding Matlab code, the speedup of this code is 7-15. As the function global static void FunctionEvaluation becomes more complicated, I got more speedup. ( this is what we expect)

But the speedup over MATLAB drops when there are more segments, or the â€œNâ€, which is size of the matrix and vector, increase. ( This is a little surprising as I thought GPU could be more efficient than MATLAB for large matrix-vector computation)

And I think followings are some possible inefficiencies there:

â€¢ 1. Some are run on CPU and some on GPU, and this situation switches many times.

â€¢ 2.All the matrix-vector computation are computed using CUBLAS, but variable definition and saving, and checking the condition are done on CPU.

Some explanations of the code:

â€¢ There are two loops here: outer loop is run through segment, inner loop is through iteration. (for the 7-15 speedup, outer loop run once and inner loop run 12 times)

â€¢ A vector called h_V0_x will be reset after each inner loop by a scalar called vx0_segment.

â€¢ The inner loop called a global static void FunctionEvaluation, which computes some forces on N+1 threads, and this is the place to hope to get GPU benefits the most.

â€¢ After getting d_Vxnew, its difference from the old value d_Vxold is computed, and this difference is used as the condition of two branches. There is another embedded branch.

â€¢ After existing the inner loop, a new value of vx0_segment is set. A vector called h_a_x is computed and saved in a larger matrix called a_x_his.

Thank you very much for any comment or suggestion!

``````for (segMent=1;segMent<Nsegment+1;segMent++){

for (i=0;i<=N;i++){

h_V0_x[i]=vx0_segment;%%%%define a constant vector, reset vx0_segment at each iteartion

}

status = cublasSetVector(dimension, sizeof(h_X0_x[0]), h_X0_x, 1, d_X0_x, 1);

for(iterations=1;iterations<maxIteration;iterations++){

NumFuncCall= NumFuncCall+1;

FunctionEvaluation<<<Block_Num,Thread_Num,0>>>(d_Xold,d_Yold,d_Zold,d_fvx,d_fvy,d_fvz);%%%   FunctionEvaluation is __global__ static void

cublasScopy(N+1,d_V0_x,1,d_VXnew,1);%%%d_VXnew=d_V0_x;

cublasSgemv('n', dimension, dimension, 1, d_Cbf, dimension, d_fvx, 1, 1, d_VXnew, 1);%%d_VXnew=d_Cbf*d_fvx+d_VXnew;

cublasScopy(N+1,d_X0_x,1,d_Xold,1);%%d_Xold=d_X0_x;

cublasSaxpy(dimension,-1,d_VXnew,1,d_VXold,1);%%d_VXold=d_VXold-d_VXnew

maxIndex_vx=cublasIsamax (dimension, d_VXold, 1)-1;%%%find the max of the magnitude

status = cublasGetVector(dimension, sizeof(h_VXold[0]), d_VXold, 1, h_VXold, 1);

maxE=abs(h_VXold[maxIndex_vx]);

if(iterations==1){

e_old=maxE;

cublasScopy(N+1,d_VXnew,1,d_VXold,1);%%d_VXold=d_VXnew

cublasSgemv('n', dimension, dimension, 1, d_CbfT2tao, dimension, d_VXnew, 1, 1, d_Xold, 1); }%%d_Xold=d_CbfT2tao*d_VXnew+d_Xold

else{

e_new=maxE;

cublasScopy(N+1,d_VXnew,1,d_VXold,1);

cublasSgemv('n', dimension, dimension, 1, d_CbfT2tao, dimension, d_VXnew, 1, 1, d_Xold, 1);

if(e_new<deltaLimit && e_old<deltaLimit){

break;

else

e_old=e_new;

}

}

status = cublasGetVector(N+1, sizeof(h_Xold[0]), d_Xold, 1, h_Xold, 1);%%%after existing the iteration, reset vx0_segment and compute/save h_a_x

x0_segment=h_Xold[0];

cublasSgemv('n', dimension, dimension, 1, d_TV, dimension, d_Xold, 1, 0, d_a_x, 1);%%%d_a_x=d_TV*d_Xold;

status = cublasGetVector(N+1, sizeof(h_a_x[0]), d_a_x, 1, h_a_x, 1);

for(i=0;i<=N;i++){

a_x_his[(N+1)*(segMent-1)+i]=h_a_x[i];%%%save h_a_x of each segment

}

if(iterations==maxIteration){

printf("max iteration did not converge");

break;

}

}
``````