I have a code which I attached at the end and I think it is not efficient. Compared with the corresponding Matlab code, the speedup of this code is 7-15. As the function **global** static void FunctionEvaluation becomes more complicated, I got more speedup. ( this is what we expect)

But the speedup over MATLAB drops when there are more segments, or the â€œNâ€, which is size of the matrix and vector, increase. ( This is a little surprising as I thought GPU could be more efficient than MATLAB for large matrix-vector computation)

And I think followings are some possible inefficiencies there:

â€¢ 1. Some are run on CPU and some on GPU, and this situation switches many times.

â€¢ 2.All the matrix-vector computation are computed using CUBLAS, but variable definition and saving, and checking the condition are done on CPU.

Some explanations of the code:

â€¢ There are two loops here: outer loop is run through segment, inner loop is through iteration. (for the 7-15 speedup, outer loop run once and inner loop run 12 times)

â€¢ A vector called h_V0_x will be reset after each inner loop by a scalar called vx0_segment.

â€¢ The inner loop called a **global** static void FunctionEvaluation, which computes some forces on N+1 threads, and this is the place to hope to get GPU benefits the most.

â€¢ After getting d_Vxnew, its difference from the old value d_Vxold is computed, and this difference is used as the condition of two branches. There is another embedded branch.

â€¢ After existing the inner loop, a new value of vx0_segment is set. A vector called h_a_x is computed and saved in a larger matrix called a_x_his.

Thank you very much for any comment or suggestion!

```
for (segMent=1;segMent<Nsegment+1;segMent++){
for (i=0;i<=N;i++){
h_V0_x[i]=vx0_segment;%%%%define a constant vector, reset vx0_segment at each iteartion
}
status = cublasSetVector(dimension, sizeof(h_X0_x[0]), h_X0_x, 1, d_X0_x, 1);
for(iterations=1;iterations<maxIteration;iterations++){
NumFuncCall= NumFuncCall+1;
FunctionEvaluation<<<Block_Num,Thread_Num,0>>>(d_Xold,d_Yold,d_Zold,d_fvx,d_fvy,d_fvz);%%% FunctionEvaluation is __global__ static void
cublasScopy(N+1,d_V0_x,1,d_VXnew,1);%%%d_VXnew=d_V0_x;
cublasSgemv('n', dimension, dimension, 1, d_Cbf, dimension, d_fvx, 1, 1, d_VXnew, 1);%%d_VXnew=d_Cbf*d_fvx+d_VXnew;
cublasScopy(N+1,d_X0_x,1,d_Xold,1);%%d_Xold=d_X0_x;
cublasSaxpy(dimension,-1,d_VXnew,1,d_VXold,1);%%d_VXold=d_VXold-d_VXnew
maxIndex_vx=cublasIsamax (dimension, d_VXold, 1)-1;%%%find the max of the magnitude
status = cublasGetVector(dimension, sizeof(h_VXold[0]), d_VXold, 1, h_VXold, 1);
maxE=abs(h_VXold[maxIndex_vx]);
if(iterations==1){
e_old=maxE;
cublasScopy(N+1,d_VXnew,1,d_VXold,1);%%d_VXold=d_VXnew
cublasSgemv('n', dimension, dimension, 1, d_CbfT2tao, dimension, d_VXnew, 1, 1, d_Xold, 1); }%%d_Xold=d_CbfT2tao*d_VXnew+d_Xold
else{
e_new=maxE;
cublasScopy(N+1,d_VXnew,1,d_VXold,1);
cublasSgemv('n', dimension, dimension, 1, d_CbfT2tao, dimension, d_VXnew, 1, 1, d_Xold, 1);
if(e_new<deltaLimit && e_old<deltaLimit){
break;
else
e_old=e_new;
}
}
status = cublasGetVector(N+1, sizeof(h_Xold[0]), d_Xold, 1, h_Xold, 1);%%%after existing the iteration, reset vx0_segment and compute/save h_a_x
x0_segment=h_Xold[0];
cublasSgemv('n', dimension, dimension, 1, d_TV, dimension, d_Xold, 1, 0, d_a_x, 1);%%%d_a_x=d_TV*d_Xold;
status = cublasGetVector(N+1, sizeof(h_a_x[0]), d_a_x, 1, h_a_x, 1);
for(i=0;i<=N;i++){
a_x_his[(N+1)*(segMent-1)+i]=h_a_x[i];%%%save h_a_x of each segment
}
if(iterations==maxIteration){
printf("max iteration did not converge");
break;
}
}
```