shared memory in cuda fortran and increasing time of process

hello …

this is my code that i bring it below … my problem is that when i transfer phi0 matrix from global to shared memory i’ll get correct answer but the time of processing is going to increased that is not what i’m expect because as we know the time for accessing for shared memory is less than global memory …

i’ll be appreciate that if you tell me why and where is my problem …

this is my module code :

module simpleOps_m 
use cudafor 


attributes(global) subroutine inc(phip1,phim1,phi0,coef,rho0, ds,N_z,N_rho) 

implicit none 

! matrix size in (N_z=2001,N_rho=6001) 

real*8  :: phim1(N_z,N_rho),phi0(N_z,N_rho),coef(N_z,N_rho),rho0(N_z,N_rho) 
real*8  :: phip1(N_z,N_rho) 
real*8 , shared :: phi0_s(0:33,0:33) 

real*8, value :: ds 

integer , value :: N_z,N_rho 

integer :: i, j , is , js 

is = threadIdx%x 
js = threadIdx%y 

i = (blockIdx%x-1)*blockDim%x + is 
j = (blockIdx%y-1)*blockDim%y + js 

if( .AND. i.le.N_z .AND. .AND. j.le.N_rho) then 

if (is .eq.1 .and. then 
else if (is .eq. 32 .and. then 
end if 

if (js .eq. 1 .and. 
else if (js .eq. 32 .and. then 

end if 

if (1<i .and. i<N_z .and. 1<j .and. j<N_rho) then 

     phip1(i,j)=-phim1(i,j)+coef(i,j)*((-4.0d0+2.0d0/coef(i,j))*phi0_s(is,js) & 
                           +phi0_s(is-1,js) & !...%down    to center 
                           +phi0_s(is+1,js) & !...%up  to center 
          +(1.0d0-ds/(2.0d0*rho0(i,j)))*phi0_s(is,js-1) & !...%left  to center 

end if 

end subroutine inc 
end module simpleOps_m 


and i invoked it by this code :

call inc<<<grid,tBlock>>>(phip1_d,phim1_d,phi0_d,coef_d,rho0_d, ds,N_z,N_rho)

and my grid and block size is :

grid = dim3(ceiling(real(N_z)/32), ceiling(real(N_rho)/32), 1) 
tBlock = dim3(32,32,1)

best regard