shared memory in cuda fortran and increasing time of process

hello …

this is my code that i bring it below … my problem is that when i transfer phi0 matrix from global to shared memory i’ll get correct answer but the time of processing is going to increased that is not what i’m expect because as we know the time for accessing for shared memory is less than global memory …

i’ll be appreciate that if you tell me why and where is my problem …

this is my module code :

module simpleOps_m
use cudafor


attributes(global) subroutine inc(phip1,phim1,phi0,coef,rho0, ds,N_z,N_rho)

implicit none

! matrix size in (N_z=2001,N_rho=6001)

real*8  :: phim1(N_z,N_rho),phi0(N_z,N_rho),coef(N_z,N_rho),rho0(N_z,N_rho)
real*8  :: phip1(N_z,N_rho)
real*8 , shared :: phi0_s(0:33,0:33)

real*8, value :: ds

integer , value :: N_z,N_rho

integer :: i, j , is , js

is = threadIdx%x
js = threadIdx%y

i = (blockIdx%x-1)*blockDim%x + is
j = (blockIdx%y-1)*blockDim%y + js

if( .AND. i.le.N_z .AND. .AND. j.le.N_rho) then 

if (is .eq.1 .and. then
else if (is .eq. 32 .and. then 
end if 

if (js .eq. 1 .and. 
else if (js .eq. 32 .and. then 

end if

if (1<i .and. i<N_z .and. 1<j .and. j<N_rho) then

     phip1(i,j)=-phim1(i,j)+coef(i,j)*((-4.0d0+2.0d0/coef(i,j))*phi0_s(is,js) &
                           +phi0_s(is-1,js) & !...%down    to center
                           +phi0_s(is+1,js) & !...%up  to center
          +(1.0d0-ds/(2.0d0*rho0(i,j)))*phi0_s(is,js-1) & !...%left  to center

end if

end subroutine inc
end module simpleOps_m


and i invoked it by this code :

call inc<<<grid,tBlock>>>(phip1_d,phim1_d,phi0_d,coef_d,rho0_d, ds,N_z,N_rho)

and my grid and block size is :

grid = dim3(ceiling(real(N_z)/32), ceiling(real(N_rho)/32), 1)
tBlock = dim3(32,32,1)

best regard @@MIRI@@

Hi Miri,

Shared memory does have some overhead due to the set-up code. In this case, the overhead is greater than the benefit of the improved data access.

For shared memory to benefit here, you’ll need to use the shared memory more. For example, I put a 100 iteration loop around the kernels’ computation and the shared memory version in this case was about 4x faster.

  • Mat