I bring most time-consuming part of my code below …

when I’m using global memory, execution time is less than when I utilize shared memory …

```
attributes(global) subroutine inc(p1,m,p, ds,r,c)
implicit none
real(fp_kind),dimension(N_z,N_rho) :: m,p,p1
real(fp_kind),dimension(1,N_rho) ::r
real(fp_kind),dimension(N_z,1) ::c
real(fp_kind), value :: ds
integer :: i, j , is , js , iter
real(fp_kind) , shared, dimension(0:(Bdimy+1),0:(Bdimx+1)) :: p_s
is = threadIdx%x
js = threadIdx%y
i = (blockIdx%x-1)*blockDim%x + is
j = (blockIdx%y-1)*blockDim%y + js
!!!!!!!!!! transfer p to shared memory
if(i>=1 .AND. i<=N_z .AND. j>=1 .AND. j<=N_rho) then
p_s(is,js)=p(i,j)
endif
if (is==1 .and. i>1) then
p_s(is-1,js)=p(i-1,j)
else if (is==Bdimy .and. i<N_z) then
p_s(is+1,js)=p(i+1,j)
end if
if (js==1 .and. j>1)then
p_s(is,js-1)=p(i,j-1)
end if
if (js==Bdimx .and. j<N_rho) then
p_s(is,js+1)=p(i,j+1)
end if
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
do iter = 1,100
if (1<i .and. i<N_z .and. 1<j .and. j<N_rho) then
p1(i,j)=-m(i,j)+c(i,1)* ((-4._fp_kind+2._fp_kind/c(i,1))*p_s(is,js) &
+p_s(is-1,js) & !...%down to center
+p_s(is+1,js) & !...%up to center
+(1._fp_kind-ds/(2._fp_kind*r(1,j)))*p_s(is,js-1) & !...%left to center
+(1._fp_kind+ds/(2._fp_kind*r(1,j)))*p_s(is,js+1))
end if
end do
return
end subroutine inc
```

and this function invoked by this directive :

```
call inc<<<grid,tBlock>>>(p1_d,m_d,p_d, ds,r_d,c_d)
```

and the thread per block for my code is equal to (Bdimy=128,Bdimx=1)

I put a 1000 iteration loop around the kernels’ computation to increasing computation and usage of shared memory for the problem but it’s not working and still global memory is faster than shared memory …

in both cases (global and shared memory usage) I give a correct answer from my code but my problem is about execution time …

where is the problem ? is there any way to decreases execution time for the code ?

Best regard