shared memory in cuda fortran and increasing time of process

rmiri61 · September 2, 2016, 2:15pm

hello …

this is my code that i bring it below … my problem is that when i transfer phi0 matrix from global to shared memory i’ll get correct answer but the time of processing is going to increased that is not what i’m expect because as we know the time for accessing for shared memory is less than global memory …

i’ll be appreciate that if you tell me why and where is my problem …

this is my module code :

module simpleOps_m
use cudafor

contains

attributes(global) subroutine inc(phip1,phim1,phi0,coef,rho0, ds,N_z,N_rho)

implicit none

! matrix size in (N_z=2001,N_rho=6001)

real*8  :: phim1(N_z,N_rho),phi0(N_z,N_rho),coef(N_z,N_rho),rho0(N_z,N_rho)
real*8  :: phip1(N_z,N_rho)
real*8 , shared :: phi0_s(0:33,0:33)

real*8, value :: ds

integer , value :: N_z,N_rho

integer :: i, j , is , js


is = threadIdx%x
js = threadIdx%y

i = (blockIdx%x-1)*blockDim%x + is
j = (blockIdx%y-1)*blockDim%y + js

if(i.ge.1 .AND. i.le.N_z .AND. j.ge.1 .AND. j.le.N_rho) then 
phi0_s(is,js)=phi0(i,j) 
endif 

if (is .eq.1 .and. i.gt.1) then
phi0_s(is-1,js)=phi0(i-1,j) 
else if (is .eq. 32 .and. i.lt.N_z) then 
phi0_s(is+1,js)=phi0(i+1,j) 
end if 

if (js .eq. 1 .and. j.gt.1)then 
phi0_s(is,js-1)=phi0(i,j-1) 
else if (js .eq. 32 .and. j.lt.N_rho) then 
phi0_s(is,js+1)=phi0(i,j+1) 

end if



if (1<i .and. i<N_z .and. 1<j .and. j<N_rho) then

     phip1(i,j)=-phim1(i,j)+coef(i,j)*((-4.0d0+2.0d0/coef(i,j))*phi0_s(is,js) &
                           +phi0_s(is-1,js) & !...%down    to center
                           +phi0_s(is+1,js) & !...%up  to center
          +(1.0d0-ds/(2.0d0*rho0(i,j)))*phi0_s(is,js-1) & !...%left  to center
          +(1.0d0+ds/(2.0d0*rho0(i,j)))*phi0_s(is,js+1))


end if

return
end subroutine inc
end module simpleOps_m


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

and i invoked it by this code :

call inc<<<grid,tBlock>>>(phip1_d,phim1_d,phi0_d,coef_d,rho0_d, ds,N_z,N_rho)

and my grid and block size is :

grid = dim3(ceiling(real(N_z)/32), ceiling(real(N_rho)/32), 1)
tBlock = dim3(32,32,1)

best regard @@MIRI@@

MatColgrove · September 2, 2016, 3:42pm

Hi Miri,

Shared memory does have some overhead due to the set-up code. In this case, the overhead is greater than the benefit of the improved data access.

For shared memory to benefit here, you’ll need to use the shared memory more. For example, I put a 100 iteration loop around the kernels’ computation and the shared memory version in this case was about 4x faster.

Mat

Topic		Replies	Views
shared memory in cuda fortran and increasing time of process CUDA Programming and Performance	0	586	September 2, 2016
shared memory in cuda fortran Legacy PGI Compilers	4	2761	August 29, 2016
shared memory in cuda fortran CUDA Programming and Performance	0	726	August 29, 2016
using shared memory in my cuda code decreases execution time Legacy PGI Compilers	1	2578	December 12, 2016
cuda fortran fdtd code and i have problem with shared memory Legacy PGI Compilers	1	1868	August 15, 2016
shared memory double precision problem! Legacy PGI Compilers	1	1662	October 25, 2018
why is shared memory example not faster CUDA Programming and Performance	7	1410	May 16, 2012
Taking data into shared memory consume time. CUDA Programming and Performance	1	1629	September 10, 2009
Shared memory as slow as global memory CUDA Programming and Performance	8	4619	September 5, 2016
Shared memory problem CUDA Programming and Performance	10	4119	April 20, 2010

shared memory in cuda fortran and increasing time of process

Related topics