shared memory in cuda fortran

hello …

i’ll be appreciate if you tell me where is my problem …

There is part of my simple code that I use shared memory to make process time faster but I get wrong answer.

It seems that phi0 don’t shift in array(phi0(is,js+1) is equal to phi0(is,js) for my code). For simplicity I used block threads in dimension of (32,32).

when i using global memory my answer is going to be correct and if i transfer phi0 to shared memory my answer is going to be wrong …

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module simpleOps_m
use cudafor

contains

attributes(global) subroutine inc(phip1,phim1,phi0,coef,rho0, ds,N_z,N_rho)

implicit none

! matrix size in (N_z=2016,N_rho=6016)

real*8 , device :: phim1(:,:),phi0(:,:),coef(:,:),rho0(:,:)
real*8 , intent(out) :: phip1(:,:)
real*8 , shared :: phi0_s(0:33,0:33)

real*8, value :: ds

integer , value :: N_z,N_rho

integer :: i, j , is , js , k , m



is = threadIdx%x
js = threadIdx%y

i = (blockIdx%x-1)*blockDim%x + is
j = (blockIdx%y-1)*blockDim%y + js

if(i>1 .AND. i<N_z .AND. j>1 .AND. j<N_rho) then

phi0_s(is,js)=phi0(i,j)

    call syncthreads()

if (is .eq. 1 ) then
phi0_s(is-1,js)=phi0(i-1,j)
else if (is .eq. 32)then
phi0_s(is+1,js)=phi0(i+1,j)
end if
if (js .eq. 1 )then
phi0_s(is,js-1)=phi0(i,j-1)
else if (js .eq. 32) then
phi0_s(is,js+1)=phi0(i,j+1)

end if

end if

    call syncthreads()

if (1<i .and. i<N_z .and. 1<j .and. j<N_rho) then


phip1(i,j)=-phim1(i,j)+coef(i,j)*((-4.0d0+2.0d0/coef(i,j))*phi0_s(is,js) &
                          +phi0_s(is-1,js) & !...%down    to center
                          +phi0_s(is+1,js) & !...%up  to center
          +(1.0d0-ds/(2.0d0*rho0(i,j)))*phi0_s(is,js-1) & !...%left  to center
          +(1.0d0+ds/(2.0d0*rho0(i,j)))*phi0_s(is,js+1))

end if

end subroutine inc
end module simpleOps_m



!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

my grid and block size is :

grid = dim3(ceiling(real(nx)/32), ceiling(real(ny)/32), 1) 
              tBlock = dim3(32,32,1)

and i invoked function inc by this code :


call inc<<<grid,tBlock>>>(phip1_d,phim1_d,phi0_d,coef_d,rho0_d, ds,N_z,N_rho)






best regard @@miri@@

Hi @@miri@@,

Shouldn’t the first if statement really be something like the following so the edges of the array are loaded into shared memory as well? Then add the additional conditionals for bounds checking the phi0 array.

if(i.ge.1 .AND. i.le.N_z .AND. j.ge1 .AND. j.le.N_rho) then 
phi0_s(is,js)=phi0(i,j) 
endif

if (is .eq. 1 .and i.gt.1) then 
phi0_s(is-1,js)=phi0(i-1,j) 
else if (is .eq. 32 .and. i.lt.N_z) then 
phi0_s(is+1,js)=phi0(i+1,j) 
end if 

if (js .eq. 1 .and. js.gt.1)then 
phi0_s(is,js-1)=phi0(i,j-1) 
else if (js .eq. 32 .and. js.lt.N_rho) then 
phi0_s(is,js+1)=phi0(i,j+1) 

call syncthreads()
  • Mat

i do changes but it doesn’t work unfortunatly …

Hi @@miri@@,

Do you get any error when launching your kernel?

  call inc<<<grid,tBlock>>>(phip1_d,phim1_d,phi0_d,coef_d,rho0_d, ds,N_z,N_rho)

  ierrSync = cudaGetLastError()
  ierrAsync = cudaDeviceSynchronize()
  if (ierrSync /= cudaSuccess) write(*,*) &
   "Sync kernel error:", cudaGetErrorString(ierrSync)
  if (ierrAsync /= cudaSuccess) write(*,*) &
   "Async kernel error:", cudaGetErrorString(ierrAsync)

I wrote up an example and got the following error:

Sync kernel error:
too many resources requested for launch

The problem being that with 1024 threads each using 9248 bytes of shared memory, the program uses about double the maximum amount of shared memory available per block.

Can you try using a 32x16 thread block size?

  • Mat

thank for your reply Mat …

now it works correctly …

i have another question that why the time of processing increased to campare with when i use global memory for phi0 ?