using shared memory in my cuda code decreases execution time

I bring most time-consuming part of my code below …

when I’m using global memory, execution time is less than when I utilize shared memory …

attributes(global) subroutine inc(p1,m,p, ds,r,c)

implicit none

real(fp_kind),dimension(N_z,N_rho)  :: m,p,p1
real(fp_kind),dimension(1,N_rho)  ::r
real(fp_kind),dimension(N_z,1)  ::c
real(fp_kind), value :: ds 

integer :: i, j , is , js , iter

real(fp_kind) , shared, dimension(0:(Bdimy+1),0:(Bdimx+1)) :: p_s

is = threadIdx%x
js = threadIdx%y

i = (blockIdx%x-1)*blockDim%x + is
j = (blockIdx%y-1)*blockDim%y + js


!!!!!!!!!! transfer p to shared memory  
 if(i>=1 .AND. i<=N_z .AND. j>=1 .AND. j<=N_rho) then 


p_s(is,js)=p(i,j) 

endif 

if (is==1 .and. i>1) then
p_s(is-1,js)=p(i-1,j) 
else if (is==Bdimy .and. i<N_z) then 
p_s(is+1,js)=p(i+1,j) 
end if 

if (js==1 .and. j>1)then 
p_s(is,js-1)=p(i,j-1)
end if
if (js==Bdimx .and. j<N_rho) then 
p_s(is,js+1)=p(i,j+1) 
end if
   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


do iter = 1,100

if (1<i .and. i<N_z .and. 1<j .and. j<N_rho) then


     p1(i,j)=-m(i,j)+c(i,1)*    ((-4._fp_kind+2._fp_kind/c(i,1))*p_s(is,js) &
                           +p_s(is-1,js) & !...%down    to center
                           +p_s(is+1,js) & !...%up  to center
          +(1._fp_kind-ds/(2._fp_kind*r(1,j)))*p_s(is,js-1) & !...%left  to center
          +(1._fp_kind+ds/(2._fp_kind*r(1,j)))*p_s(is,js+1))

end if
end do
return
end subroutine inc

and this function invoked by this directive :

call inc<<<grid,tBlock>>>(p1_d,m_d,p_d, ds,r_d,c_d)

and the thread per block for my code is equal to (Bdimy=128,Bdimx=1)
I put a 1000 iteration loop around the kernels’ computation to increasing computation and usage of shared memory for the problem but it’s not working and still global memory is faster than shared memory …

in both cases (global and shared memory usage) I give a correct answer from my code but my problem is about execution time …

where is the problem ? is there any way to decreases execution time for the code ?

Best regard

Hi @@ali@@,

My best guess without a reproducible example is that the automatic hardware caching is working well for your case and doesn’t have the extra code overhead needed for using shared memory. Now this might change if you were using a 16x16 block size instead of 128x1 since 128x1 is accessing most of the data contiguously.

Using a profiler such as PGPROF or NVPROF and comparing the two profiles will give you a more definitive answer as to the difference.

Note that your code needs a call to “syncthreads” after you update the shared memory. Without it there’s potential for a race condition.

-Mat