Dear All,

I need some help to improve the performance of the following code. I am getting only 25% theoretical and achieved occupancy which I believe can be increased. The recommendation of the nvprof is to reduce number of registers per thread. Currently it is 102. If I reduce it with the flag ‘maxregcount:32’ I am getting the theoretical and achieved occupancy 100%. However, the performance is getting much worse. I feel that it is not the way to go. But I don’t know other ways yet.

Also, I heard that the performance can be improved by copying all the data needed for a block from the global memory to the shared memory per block. However, I don’t know how to do it. In my case arrays ‘zij1(:,:), plma(:,:,:,:), plmb(:,:,:,:), jint(:,:,:), basis_a(:,:,:,:), basis_b(:,:,:,:), basis_bp(:,:,:,:)’ are stored in the global GPU memory. How can get the required chunk of it to copy to the block shared memory. Will it improve performance?

My task is to calculate the matrix VRtmp(:,:).

So, the subroutine get_LR_mat_sum_pt_prj(il,jl,bigr,zz,vmatR) is called from the following kernel region:

```
!$acc kernels
!$acc loop independent gang collapse(2)
do k=1,num_t
do j=1,num_p
call get_LR_mat_sum_pt_prj(j,k,bigr,z,vRtmp(j,k))
enddo
enddo
!$acc end loop
!$acc end kernels
```

where

```
subroutine get_LR_mat_sum_pt_prj(il,jl,bigr,zz,vmatR)
!$acc routine vector
use data_base
use flogs
implicit none
integer::mf,mi,lf,li,kf,ki,i,j,mmf,mmi
integer,intent(in)::il,jl
real(kind=id)::bigr,zz,difc
real(kind=id)::tmp,fact,func,com,comres,arg,fact_obk,com_obk,com_dir,var_lam,var_mu,carg,sarg,wig_d_f,wig_d_i
complex(kind=p4),intent(out) ::vmatR
complex(kind=id) :: sumz,sumresz,sumz_obk,sumzj,sumreszj,sumz_obkj,phase,tmpLR
kf=n_p(il)
lf=l_p(il)
mf=m_p(il)
ki=n_t(jl)
li=l_t(jl)
mi=m_t(jl)
sumz = zero
sumz_obk = zero
!$acc loop independent collapse(2) reduction(+:sumz,sumz_obk)
do i=1,nglag
!!$acc loop independent reduction(+:sumz,sumz_obk)
do j=1,ngleg
tmpLR=zij1(i,j)*plma(lf,mf,i,j)*plmb(li,mi,i,j)*jint(mi-mf,i,j)*basis_a(kf,lf,i,j)
sumz=sumz+tmpLR*basis_b(ki,li,i,j)
sumz_obk=sumz_obk+tmpLR*basis_bp(ki,li,i,j)
enddo
!$acc end loop
enddo
!$acc end loop
fact_obk=bigr*lm_fac(lf,mf)*lm_fac(li,mi)/8._id
fact=fact_obk*bigr/2._id
sumz=sumz*fact*cu**(mi-mf)
sumz_obk=sumz_obk*fact*cu**(mi-mf)
phase=cmplx(cos(zz*(k_t(inl_t(jl))-k_p(inl_p(il)))),sin(zz*(k_t(inl_t(jl))-k_p(inl_p(il)))),id)
vmatR=(sumz/bigr*target_charge*projectile_charge-sumz_obk*target_charge)/v*phase !was added to add 1/R
end subroutine get_LR_mat_sum_pt_prj
```