shared memory in cuda fortran

rmiri61 · August 29, 2016, 4:43pm

hello …

i’ll be appreciate if you tell me where is my problem …

There is part of my simple code that I use shared memory to make process time faster but I get wrong answer.

It seems that phi0 don’t shift in array(phi0(is,js+1) is equal to phi0(is,js) for my code). For simplicity I used block threads in dimension of (32,32).

when i using global memory my answer is going to be correct and if i transfer phi0 to shared memory my answer is going to be wrong …

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module simpleOps_m
use cudafor

contains

attributes(global) subroutine inc(phip1,phim1,phi0,coef,rho0, ds,N_z,N_rho)

implicit none

! matrix size in (N_z=2016,N_rho=6016)

real*8 , device :: phim1(:,:),phi0(:,:),coef(:,:),rho0(:,:)
real*8 , intent(out) :: phip1(:,:)
real*8 , shared :: phi0_s(0:33,0:33)

real*8, value :: ds

integer , value :: N_z,N_rho

integer :: i, j , is , js , k , m



is = threadIdx%x
js = threadIdx%y

i = (blockIdx%x-1)*blockDim%x + is
j = (blockIdx%y-1)*blockDim%y + js

if(i>1 .AND. i<N_z .AND. j>1 .AND. j<N_rho) then

phi0_s(is,js)=phi0(i,j)

    call syncthreads()

if (is .eq. 1 ) then
phi0_s(is-1,js)=phi0(i-1,j)
else if (is .eq. 32)then
phi0_s(is+1,js)=phi0(i+1,j)
end if
if (js .eq. 1 )then
phi0_s(is,js-1)=phi0(i,j-1)
else if (js .eq. 32) then
phi0_s(is,js+1)=phi0(i,j+1)

end if

end if

    call syncthreads()

if (1<i .and. i<N_z .and. 1<j .and. j<N_rho) then


phip1(i,j)=-phim1(i,j)+coef(i,j)*((-4.0d0+2.0d0/coef(i,j))*phi0_s(is,js) &
                          +phi0_s(is-1,js) & !...%down    to center
                          +phi0_s(is+1,js) & !...%up  to center
          +(1.0d0-ds/(2.0d0*rho0(i,j)))*phi0_s(is,js-1) & !...%left  to center
          +(1.0d0+ds/(2.0d0*rho0(i,j)))*phi0_s(is,js+1))

end if

end subroutine inc
end module simpleOps_m



!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

my grid and block size is :

grid = dim3(ceiling(real(nx)/32), ceiling(real(ny)/32), 1) 
              tBlock = dim3(32,32,1)

and i invoked function inc by this code :

call inc<<<grid,tBlock>>>(phip1_d,phim1_d,phi0_d,coef_d,rho0_d, ds,N_z,N_rho)

best regard @@miri@@

MatColgrove · August 29, 2016, 5:09pm

Hi @@miri@@,

Shouldn’t the first if statement really be something like the following so the edges of the array are loaded into shared memory as well? Then add the additional conditionals for bounds checking the phi0 array.

if(i.ge.1 .AND. i.le.N_z .AND. j.ge1 .AND. j.le.N_rho) then 
phi0_s(is,js)=phi0(i,j) 
endif

if (is .eq. 1 .and i.gt.1) then 
phi0_s(is-1,js)=phi0(i-1,j) 
else if (is .eq. 32 .and. i.lt.N_z) then 
phi0_s(is+1,js)=phi0(i+1,j) 
end if 

if (js .eq. 1 .and. js.gt.1)then 
phi0_s(is,js-1)=phi0(i,j-1) 
else if (js .eq. 32 .and. js.lt.N_rho) then 
phi0_s(is,js+1)=phi0(i,j+1) 

call syncthreads()

Mat

rmiri61 · August 29, 2016, 5:23pm

mkcolg:

Hi @@miri@@,

Shouldn’t the first if statement really be something like the following so the edges of the array are loaded into shared memory as well? Then add the additional conditionals for bounds checking the phi0 array.
if(i.ge.1 .AND. i.le.N_z .AND. j.ge1 .AND. j.le.N_rho) then 
phi0_s(is,js)=phi0(i,j) 
endif

if (is .eq. 1 .and i.gt.1) then 
phi0_s(is-1,js)=phi0(i-1,j) 
else if (is .eq. 32 .and. i.lt.N_z) then 
phi0_s(is+1,js)=phi0(i+1,j) 
end if 

if (js .eq. 1 .and. js.gt.1)then 
phi0_s(is,js-1)=phi0(i,j-1) 
else if (js .eq. 32 .and. js.lt.N_rho) then 
phi0_s(is,js+1)=phi0(i,j+1) 

call syncthreads()
Mat

i do changes but it doesn’t work unfortunatly …

MatColgrove · August 29, 2016, 6:13pm

Hi @@miri@@,

Do you get any error when launching your kernel?

  call inc<<<grid,tBlock>>>(phip1_d,phim1_d,phi0_d,coef_d,rho0_d, ds,N_z,N_rho)

  ierrSync = cudaGetLastError()
  ierrAsync = cudaDeviceSynchronize()
  if (ierrSync /= cudaSuccess) write(*,*) &
   "Sync kernel error:", cudaGetErrorString(ierrSync)
  if (ierrAsync /= cudaSuccess) write(*,*) &
   "Async kernel error:", cudaGetErrorString(ierrAsync)

I wrote up an example and got the following error:

Sync kernel error:
too many resources requested for launch

The problem being that with 1024 threads each using 9248 bytes of shared memory, the program uses about double the maximum amount of shared memory available per block.

Can you try using a 32x16 thread block size?

Mat

rmiri61 · August 29, 2016, 7:38pm

thank for your reply Mat …

now it works correctly …

i have another question that why the time of processing increased to campare with when i use global memory for phi0 ?

Topic		Replies	Views
shared memory in cuda fortran CUDA Programming and Performance	0	713	August 29, 2016
cuda fortran fdtd code and i have problem with shared memory Legacy PGI Compilers	1	1868	August 15, 2016
shared memory in cuda fortran and increasing time of process Legacy PGI Compilers	1	2077	September 2, 2016
shared memory in cuda fortran and increasing time of process CUDA Programming and Performance	0	586	September 2, 2016
help getting shared memory working CUDA Programming and Performance	11	4387	June 12, 2007
Issue with cuda fortran code Legacy PGI Compilers	1	2422	October 5, 2011
how can i free(delete) arrays in shared memory?? Legacy PGI Compilers	6	6756	February 16, 2012
Some confusion on using shared memory. CUDA Programming and Performance	26	9360	June 2, 2009
Problems doing shared memory test CUDA Programming and Performance	1	1557	February 17, 2008
shared memory problem usage in variables CUDA Programming and Performance	8	2548	September 22, 2010

shared memory in cuda fortran

Related topics