shared memory in cuda fortran

hello …

i’ll be appreciate if you tell me where is my problem …

There is part of my simple code that I use shared memory to make process time faster but I get wrong answer.

It seems that phi0 don’t shift in array(phi0(is,js+1) is equal to phi0(is,js) for my code). For simplicity I used block threads in dimension of (32,32).

when i using global memory my answer is going to be correct and if i transfer phi0 to shared memory my answer is going to be wrong …

module simpleOps_m
use cudafor


attributes(global) subroutine inc(phip1,phim1,phi0,coef,rho0, ds,N_z,N_rho)

implicit none

! matrix size in (N_z=2016,N_rho=6016)

real*8 , device :: phim1(:,:),phi0(:,:),coef(:,:),rho0(:,:)
real*8 , intent(out) :: phip1(:,:)
real*8 , shared :: phi0_s(0:33,0:33)

real*8, value :: ds

integer , value :: N_z,N_rho

integer :: i, j , is , js , k , m

is = threadIdx%x
js = threadIdx%y

i = (blockIdx%x-1)*blockDim%x + is
j = (blockIdx%y-1)*blockDim%y + js

if(i>1 .AND. i<N_z .AND. j>1 .AND. j<N_rho) then


    call syncthreads()

if (is .eq. 1 ) then
else if (is .eq. 32)then
end if
if (js .eq. 1 )then
else if (js .eq. 32) then

end if

end if

    call syncthreads()

if (1<i .and. i<N_z .and. 1<j .and. j<N_rho) then

phip1(i,j)=-phim1(i,j)+coef(i,j)*((-4.0d0+2.0d0/coef(i,j))*phi0_s(is,js) &
                          +phi0_s(is-1,js) & !...%down    to center
                          +phi0_s(is+1,js) & !...%up  to center
          +(1.0d0-ds/(2.0d0*rho0(i,j)))*phi0_s(is,js-1) & !...%left  to center

end if

end subroutine inc
end module simpleOps_m


my grid and block size is :

grid = dim3(ceiling(real(nx)/32), ceiling(real(ny)/32), 1) 
              tBlock = dim3(32,32,1)

and i invoked function inc by this code :

call inc<<<grid,tBlock>>>(phip1_d,phim1_d,phi0_d,coef_d,rho0_d, ds,N_z,N_rho)

best regard @@miri@@

Hi @@miri@@,

Shouldn’t the first if statement really be something like the following so the edges of the array are loaded into shared memory as well? Then add the additional conditionals for bounds checking the phi0 array.

if( .AND. i.le.N_z .AND. j.ge1 .AND. j.le.N_rho) then 

if (is .eq. 1 .and then 
else if (is .eq. 32 .and. then 
end if 

if (js .eq. 1 .and. 
else if (js .eq. 32 .and. then 

call syncthreads()
  • Mat

i do changes but it doesn’t work unfortunatly …

Hi @@miri@@,

Do you get any error when launching your kernel?

  call inc<<<grid,tBlock>>>(phip1_d,phim1_d,phi0_d,coef_d,rho0_d, ds,N_z,N_rho)

  ierrSync = cudaGetLastError()
  ierrAsync = cudaDeviceSynchronize()
  if (ierrSync /= cudaSuccess) write(*,*) &
   "Sync kernel error:", cudaGetErrorString(ierrSync)
  if (ierrAsync /= cudaSuccess) write(*,*) &
   "Async kernel error:", cudaGetErrorString(ierrAsync)

I wrote up an example and got the following error:

Sync kernel error:
too many resources requested for launch

The problem being that with 1024 threads each using 9248 bytes of shared memory, the program uses about double the maximum amount of shared memory available per block.

Can you try using a 32x16 thread block size?

  • Mat

thank for your reply Mat …

now it works correctly …

i have another question that why the time of processing increased to campare with when i use global memory for phi0 ?