questions from a test code

[1]
Actually, for the code below I was not supposed to add acc derectives outside.And the informational message also told me that"PGF90-W-0155-Accelerator region ignored;" However, I did get a speed up without any error compared with nothing added.(from 6.84 [GFlops] to 8.89 [GFlops] ). How come?

!$acc region
do icnt=1, 999999
if( mod(icnt,100) == 0 ) print 1000, icnt, time1+dt
call cpu_diff3d(f,fn,nx,ny,nz,ce,cw,cn,cs,ct,cb,cc)
flops = flops + dble(nxnynz)13.0
f = fn
time1 = time1 + dt
if( time1+0.5
dt >= 0.1 ) exit
end do
!$acc end region
[2]
Well, am I allowed to go a little further? For this calculation, I`ve wrote the code with Gpu kernel and just acc derective both. While the speed is quite different. As I said above, the acc version got a 8.89 [GFlops] right now with a probable mistake. On the other hand, the Gpu version got a 26.57 [GFlops] easily.(the copyin part costs 3.99s while the d_f=f and d_fn=f only costs Time 1.51E-002s) Is that regular? Or does it mean that I still have a lot of rising space for my acc code. Maybe I should paste my test code below even though I know somepart looks strange. Please forgive me since I am just for test.

program test
use openacc
implicit none
integer, parameter :: nx = 128
integer, parameter :: ny = nx
integer, parameter :: nz = nx
integer, parameter :: n = nxnynz
integer :: i, j, k, stat, icnt
real(8) :: clock_start, clock_finish, &
time1, Lx, Ly, Lz, &
dx, dy, dz, dt, &
kx, ky, kz, pi, &
x, y, z, &
kappa, flops, ferr, f0, &
ce, cw, cn, cs, ct, cb, cc, &
ax, ay, az
real(8), dimension(nx,ny,nz) :: f, fn

time1 = 0.0
ferr = 0.0
pi = 4.0atan(1.0)
Lx = 1.0
Ly = 1.0
Lz = 1.0
dx = Lx/dble(nx)
dy = Ly/dble(ny)
dz = Lz/dble(nz)
kx = 2.0
pi
ky = kx
kz = kx
kappa = 0.1
flops = 0.0
dt = 0.1dxdx/kappa
ce = kappadt/(dxdx)
cw = kappadt/(dxdx)
cn = kappadt/(dydy)
cs = kappadt/(dydy)
ct = kappadt/(dzdz)
cb = kappadt/(dzdz)
cc = 1.0 - (ce + cw + cn + cs + ct + cb)
call cpu_time(clock_start)
!$acc data copyin(f,fn),create(x,y,z,ce,cw,cn,cs,ct,cb,cc)
!$acc region
do k=1, nz
do j=1, ny
do i=1, nx
x = dx*(dble(i)-0.5)
y = dy*(dble(j)-0.5)
z = dz*(dble(k)-0.5)
f(i,j,k) = 0.125*(1.0 - cos(kxx))(1.0 - cos(kyy))(1.0 - cos(kzz))
end do
end do
end do
!$acc end region
call cpu_time(clock_finish)
write(
,*) “1”, (clock_finish - clock_start)

call cpu_time(clock_start)
!!$acc region
do icnt=1, 999999
if( mod(icnt,100) == 0 ) print 1000, icnt, time1+dt
call cpu_diff3d(f,fn,nx,ny,nz,ce,cw,cn,cs,ct,cb,cc)
flops = flops + dble(nxnynz)13.0
f = fn
time1 = time1 + dt
if( time1+0.5
dt >= 0.1 ) exit
end do
!!$acc end region
call cpu_time(clock_finish)
print 1002, (clock_finish - clock_start)
print 1003, flops/(clock_finish - clock_start)*1.0e-9

ax = exp(-kappatime1(kxkx))
ay = exp(-kappa
time1*(kyky))
az = exp(-kappa
time1*(kzkz))
call cpu_time(clock_start)
!$acc region
do k=1, nz
do j=1, ny
do i=1, nx
x = dx
(dble(i)-0.5)
y = dy*(dble(j)-0.5)
z = dz*(dble(k)-0.5)
f0 = 0.125*(1.0 - axcos(kxx)) &
(1.0 - aycos(kyy)) &
(1.0 - azcos(kz
z))
ferr = ferr + (f(i,j,k) - f0)(f(i,j,k) - f0)
end do
end do
end do
!$acc end region
!$acc end data
call cpu_time(clock_finish)
write(
,) “2”, (clock_finish - clock_start)
ferr = sqrt(ferr/dble(nx
ny*nz))
print 1004, nx, ny, nz, ferr

1000 format(" ", "time(“i4”)=“f7.5)
1001 format(” “, “Elapsed Time= “1pe9.3” [sec] by GPU”)
1002 format(” “, “Elapsed Time= “1pe9.3” [sec] by CPU”)
1003 format(” “, “Performance=“f6.2” [GFlops]”)
1004 format(” ", "Error[“i4”][“i4”][“i4”]= "1pe12.6)

end


subroutine cpu_diff3d(f,fn,nx,ny,nz,ce,cw,cn,cs,ct,cb,cc)
use openacc
implicit none
integer :: nx, ny, nz
integer :: i, j, k, ie, iw, jn, js, kt, kb
real(8) :: ce, cw, cn, cs, ct, cb, cc
real(8), dimension(nx,ny,nz) :: f, fn

!$acc region
do k=1, nz
do j=1, ny
do i=1, nx
ie = i + 1
iw = i - 1
jn = j + 1
js = j - 1
kt = k + 1
kb = k - 1
if( i == nx ) ie = i
if( i == 1 ) iw = i
if( j == ny ) jn = j
if( j == 1 ) js = j
if( k == nz ) kt = k
if( k == 1 ) kb = k
fn(i,j,k) = cc*f(i,j,k) &

  • cef(ie,j,k) + cwf(iw,j,k) &
  • cnf(i,jn,k) + csf(i,js,k) &
  • ctf(i,j,kt) + cbf(i,j,kb)
    end do
    end do
    end do
    !$acc end region

return
end subroutine


[3]
For the data swap code from my Gpu kernel version, it is possible to use the pointor just like the C language? How would it be?
attributes(global) subroutine gpu_swap(d_f,d_fn,nx,ny,nz)
implicit none
integer, value :: nx, ny, nz
integer :: i, j, k
real(8), dimension(nx,ny,nz), device :: d_f, d_fn

i = threadidx%x
j = blockidx%x
k = blockidx%y
if( i < nx+1 .and. j < ny+1 .and. k < nz+1 ) then
d_f(i,j,k) = d_fn(i,j,k)
end if
return
end subroutine

#1) This section of code wont accelerate due to the function calls to cpu_diff3d, the print statement, the exit call. You could inline cpu_diff3d but can’t with the other two. It still speeds up since you have an accelerator region in cpu_diff3d and have f and fn in a data region so that they aren’t copied back and forth.

#2a) For your code, you’re close but a have a few small errors.

For the data region, remove the scalars. In particular, by putting “x”, “y” and “z” in a create statement, you’ve made them global and every thread will be using the same variable. This is giving you wrong answers.

Also, you’re mixing the PGI Accelerator Model and OpenACC syntax. It’s fine, but might get confusing and reduce your portability. I’d stick to one or the other.

It looks to me that “f” and “fn” are only used in compute regions, hence you can use the “create” clause instead of “copyin”.

Here’s the updated source:

program test
  use openacc
  implicit none
  integer, parameter :: nx = 128
  integer, parameter :: ny = nx
  integer, parameter :: nz = nx
  integer, parameter :: n = nx*ny*nz
  integer :: i, j, k, stat, icnt
  real(8) :: clock_start, clock_finish, &
       time1, Lx, Ly, Lz, &
       dx, dy, dz, dt, &
       kx, ky, kz, pi, &
       x, y, z, &
       kappa, flops, ferr, f0, &
       ce, cw, cn, cs, ct, cb, cc, &
       ax, ay, az
  real(8), dimension(nx,ny,nz) :: f, fn

  time1 = 0.0
  ferr = 0.0
  pi = 4.0*atan(1.0)
  Lx = 1.0
  Ly = 1.0
  Lz = 1.0
  dx = Lx/dble(nx)
  dy = Ly/dble(ny)
  dz = Lz/dble(nz)
  kx = 2.0*pi
  ky = kx
  kz = kx
  kappa = 0.1
  flops = 0.0
  dt = 0.1*dx*dx/kappa
  ce = kappa*dt/(dx*dx)
  cw = kappa*dt/(dx*dx)
  cn = kappa*dt/(dy*dy)
  cs = kappa*dt/(dy*dy)
  ct = kappa*dt/(dz*dz)
  cb = kappa*dt/(dz*dz)
  cc = 1.0 - (ce + cw + cn + cs + ct + cb)
  call cpu_time(clock_start)

  !$acc data create(f,fn)
  !$acc kernels
  do k=1, nz
     do j=1, ny
        do i=1, nx
           x = dx*(dble(i)-0.5)
           y = dy*(dble(j)-0.5)
           z = dz*(dble(k)-0.5)
           f(i,j,k) = 0.125*(1.0 - cos(kx*x))*(1.0 - cos(ky*y))*(1.0 - cos(kz*z))
        end do
     end do
  end do
  !$acc end kernels

  call cpu_time(clock_finish)
  write(*,*) "1", (clock_finish - clock_start)

  call cpu_time(clock_start)

  do icnt=1, 999999
     if( mod(icnt,100) == 0 ) print 1000, icnt, time1+dt
     call cpu_diff3d(f,fn,nx,ny,nz,ce,cw,cn,cs,ct,cb,cc)
     flops = flops + dble(nx*ny*nz)*13.0
     f = fn
     time1 = time1 + dt
     if( time1+0.5*dt >= 0.1 ) exit
  end do

  call cpu_time(clock_finish)
  print 1002, (clock_finish - clock_start)
  print 1003, flops/(clock_finish - clock_start)*1.0e-9

  ax = exp(-kappa*time1*(kx*kx))
  ay = exp(-kappa*time1*(ky*ky))
  az = exp(-kappa*time1*(kz*kz))
  call cpu_time(clock_start)
  !$acc kernels
  do k=1, nz
     do j=1, ny
        do i=1, nx
           x = dx*(dble(i)-0.5)
           y = dy*(dble(j)-0.5)
           z = dz*(dble(k)-0.5)
           f0 = 0.125*(1.0 - ax*cos(kx*x)) &
                *(1.0 - ay*cos(ky*y)) &
                *(1.0 - az*cos(kz*z))
           ferr = ferr + (f(i,j,k) - f0)*(f(i,j,k) - f0)
        end do
     end do
  end do
  !$acc end kernels
  !$acc end data
  call cpu_time(clock_finish)
  write(*,*) "2", (clock_finish - clock_start)
  ferr = sqrt(ferr/dble(nx*ny*nz))
  print 1004, nx, ny, nz, ferr

1000 format(" ", "time("i4")="f7.5)
1001 format(" ", "Elapsed Time= "1pe9.3" [sec] by GPU")
1002 format(" ", "Elapsed Time= "1pe9.3" [sec] by CPU")
1003 format(" ", "Performance="f6.2" [GFlops]")
1004 format(" ", "Error["i4"]["i4"]["i4"]= "1pe12.6)

end program test


subroutine cpu_diff3d(f,fn,nx,ny,nz,ce,cw,cn,cs,ct,cb,cc)
  use openacc
  implicit none
  integer :: nx, ny, nz
  integer :: i, j, k, ie, iw, jn, js, kt, kb
  real(8) :: ce, cw, cn, cs, ct, cb, cc
  real(8), dimension(nx,ny,nz) :: f, fn

  !$acc kernels pcopy(f,fn)
  do k=1, nz
     do j=1, ny
        do i=1, nx
           ie = i + 1
           iw = i - 1
           jn = j + 1
           js = j - 1
           kt = k + 1
           kb = k - 1
           if( i == nx ) ie = i
           if( i == 1 ) iw = i
           if( j == ny ) jn = j
           if( j == 1 ) js = j
           if( k == nz ) kt = k
           if( k == 1 ) kb = k
           fn(i,j,k) = cc*f(i,j,k) &
                + ce*f(ie,j,k) + cw*f(iw,j,k) &
                + cn*f(i,jn,k) + cs*f(i,js,k) &
                + ct*f(i,j,kt) + cb*f(i,j,kb)
        end do
     end do
  end do
  !$acc end kernels

  return
end subroutine cpu_diff3d

#2b) For the performance questions, I’d need to see the CUDA Fortran version. OpenACC will typically be a bit slower then a well written CUDA Fortran program, but not by this much. Are you verifying correct results? Could the CUDA Fortran program performance be inflated due to differences in the algorithm?

#3) One really nice thing about the OpenACC “present” clause is that it does a run time look up for the memory. Hence, to swap pointers you’d just call cpu_diff3d and flip flop the “f” and “fn” arguments.

In CUDA Fortran, just like regular Fortran, you’d need to use “pointer” and “target” to do this. We did just add pointer support to CUDA Fortran in the 12.8 release as part of our support for texture memory (see: Account Login | PGI). Though, I don’t have an example off hand how to do this.

  • Mat

Dear Mat,

Hi, I have solved the question 2 since I found a mistake of myself. Then, as you said, I got a speed up near to the GPU kernel code.

Thank you very much. I appreciate your kind help.