How could I speed up this program?

the following is some part of a serial CUDA Fortran code, could someone help to speed up this program? Thanks a lot

module sor_rb_mod
use cudafor
contains
!-------------------------- Red block --------------------------------
attributes(global) subroutine red_kernel(nx,ny,pdev,hdev,dx,dy,omega)
implicit none
real,shared :: pdev(0:nx,0:ny),hdev(0:nx,0:ny)
real, value:: dx,dy,omega
integer, value :: nx, ny
integer :: i, j
real:: a,b,c,c1,c3,d,e,f,pd
i = (blockidx%x-1) * 16 + threadidx%x
j = (blockidx%y-1) * 16 + threadidx%y
if (mod((i+j),2)==0) then
c1=0.75hdev(i,j)hdev(i,j)(hdev(i+1,j)-hdev(i-1,j))
c3=hdev(i,j)**3
a=(c1+c3)dydy
b=(-c1+c3)dydy
c=c3
dxdx
d=c
e=2.0
c3*(dxdx+dydy)
f=6.0*(hdev(i+1,j)-hdev(i-1,j))dxdydy/2.0
pd=(a
pdev(i+1,j)+bpdev(i-1,j)+cpdev(i,j+1)+dpdev(i,j-1)-f)/e-pdev(i,j)
pdev(i,j)=pdev(i,j)+omega
pd
end if
end subroutine red_kernel
!-------------------------- Black block -------------------------------
attributes(global) subroutine black_kernel(nx,ny,pdev,hdev,dx,dy,omega)
implicit none
real,shared :: pdev(0:nx,0:ny),hdev(0:nx,0:ny)
real, value:: dx,dy,omega
integer, value :: nx, ny
integer :: i, j
real:: a,b,c,c1,c3,d,e,f,pd
i = (blockidx%x-1) * 16 + threadidx%x
j = (blockidx%y-1) * 16 + threadidx%y
if (mod((i+j),2)==1) then
c1=0.75hdev(i,j)hdev(i,j)(hdev(i+1,j)-hdev(i-1,j))
c3=hdev(i,j)**3
a=(c1+c3)dydy
b=(-c1+c3)dydy
c=c3
dxdx
d=c
e=2.0
c3*(dxdx+dydy)
f=6.0*(hdev(i+1,j)-hdev(i-1,j))dxdydy/2.0
pd=(a
pdev(i+1,j)+bpdev(i-1,j)+cpdev(i,j+1)+dpdev(i,j-1)-f)/e-pdev(i,j)
pdev(i,j)=pdev(i,j)+omega
pd
end if
end subroutine black_kernel
!-------------------------- Sor GPU ------------------------------------
subroutine sor2(nx,ny,p,h,dx,dy,omega,pave,tim)
implicit none
integer:: nx,ny
real:: p(0:nx,0:ny),h(0:nx,0:ny),dx,dy,omega,pave,tim
real, device, allocatable, dimension(:,:) :: pdev,hdev
real,allocatable:: p1(:,:)
type(dim3) :: dimGrid, dimBlock
integer :: i,j,r,iter,t3,t4,tr
nx=nx
allocate( pdev(0:nx,0:ny),hdev(0:nx,0:ny),p1(0:nx,0:ny))
hdev = h
p=0.0
pdev=p
dimGrid = dim3( nx/16, ny/16, 1 )
dimBlock = dim3( 16, 16, 1 )
!---------------------- GPU ---------------------------------------
call system_clock(t3,tr)
do iter=1,800
call red_kernel<<<dimGrid,dimBlock>>>(nx,ny,pdev,hdev,dx,dy,omega)
call black_kernel<<<dimGrid,dimBlock>>>(nx,ny,pdev,hdev,dx,dy,omega)
end do
r = cudathreadsynchronize()
!---------------------- End GPU ------------------------------------
call system_clock(t4,tr)
p1=pdev
pave=sum(p1)/real(nx+1)/real(ny+1)
write(*,’(a,3(1x,f12.3))’) ‘omega/pave/elapsed_time =’,omega,pave,real(t4-t3)/real(tr)
open(1,file=‘p_gpu.dat’)
do i=0,nx
write(1,’(1000(1x,f10.4))’) (p1(i,j), j=0,ny)
end do
tim=real(t4-t3)/real(tr)
deallocate( pdev, hdev)
end subroutine sor2
end module sor_rb_mod

Hi addison827,

You could probably speed things up a bit by using a local 18x18 shared array containing the values from of hdev and pdev arrays. Although you declared hdev and pdev as shared, they are really global arrays here and the shared attribute has no effect. Shared can only effects local variables which can then be stored in a SM’s local memory and shared by all threads in a block.

Note that I do see a few potential bugs in the program.

  1. You declare pdev and hdev as having a range of 0:nx and 0:ny. However, I think this should be 0:nx+1 and 0:ny+1 to account for the case when i equals nx and you’re accessing hdev(i+1,j) and when j equals ny and you’re accessing pdev(i,j+1).

  2. This code is problematic. Do you see why?

pd=(a*pdev(i+1,j)+b*pdev(i-1,j)+c*pdev(i,j+1)+d*pdev(i,j-1)-f)/e-pdev(i,j)
pdev(i,j)=pdev(i,j)+omega*pd
  • Mat