OpenACC FORTRAN pointer how-to question

I’ve been trying to work a program that makes use of pointers to allocatable arrays:

module Vars
real(8), pointer :: pA(:)
real(8), allocatable, target :: A(:)

!$acc declare create(pA,A)
end module Vars

allocate(A(3))
pA => A

!$acc parallel

A(3) = 3.0
pA(1) = 1.0

!$acc end parallel
!$acc update host(A)
!-- A comes back as { 0.0, 0.0, 3.0 }

What needs to be done to have the pA in device memory reference the device copy of the data that the host pointer targets?

(Let me know if you need a more concrete example.)

There’s two ways to fix this.

First, remove “pA” from the declare create, then add it to a “present” clause. Since A and pA point to the same host address, when the compiler performs the present check, it will associate pA to the A’s device copy.

% cat test.f90
module Vars
real(8), pointer :: pA(:)
real(8), allocatable, target :: A(:)

!$acc declare create(A)
end module Vars

program foo

use Vars
allocate(A(3))
pA => A

!$acc serial present(pA)
pA(3) = 3.0
pA(1) = 1.0
!$acc end serial

!$acc update host(A)
print *, A
deallocate(A)
end program foo

% nvfortran -acc test.f90 ; a.out
    1.000000000000000         0.000000000000000         3.000000000000000

The problem with this solution is if you need pA in a declare create in order to support directly accessing the variable from within a device routines. In this case, keep pA in the declare create but then call acc_attach to update the device copy of pA to point to the device copy of A.

% cat test2.F90
module Vars
real(8), pointer :: pA(:)
real(8), allocatable, target :: A(:)

!$acc declare create(A,pA)

contains

subroutine setVal(idx,val)
!$acc routine seq
integer, value :: idx
real(8), value :: val
pA(idx)=val
end subroutine setVal

end module Vars

program foo

use Vars
#ifdef _OPENACC
use openacc
#endif
allocate(A(3))
pA => A

#ifdef _OPENACC
call acc_attach(pA)
#endif

!$acc serial present(pA)
call setVal(3,3.0_8)
call setVal(1,1.0_8)
!$acc end serial

!$acc update host(A)
print *, A
deallocate(A)
end program foo

% nvfortran -acc test2.F90 ; a.out
    1.000000000000000         0.000000000000000         3.000000000000000

Hope this helps,
Mat

This is very helpful. Solved another issue for my OpenACC port.

I only have the OpenACC API. Can you recommend a practical reference or a book to help me get the programming nuances? I don’t think I would have figured the pointer issue without your assist.

Thanks!

There’s two OpenACC books (see: https://www.openacc.org/resources).

I did write the Data Management Chapter (#5) in Parallel Programming with OpenACC , most of the examples are written in C, though it might be helpful in understanding some of the concepts. The examples from the book are available at no cost at: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/tree/master/Chapter05

OpenACC for Programmers: Concepts and Strategies is newer and probably what I’d recommend you get first. It’s designed more as a textbook for classrooms.

The resources link does have several online classes as well.

There’s also several GPUBootcamps (GPU Bootcamp | NVIDIA Developer) throughout the year, and if you can put together a team, the GPUHackathons (https://gpuhackathons.org/) are very good (I mentor 4-6 of them a year).

It looks like you can also remove both the declare line and the update line, the compiler will copy in the kernel it needs.

Here is the code:

module Vars
real(8), pointer :: pA(:)
real(8), allocatable, target :: A(:)
integer n

 !!$acc declare create(A)
 end module Vars

 program foo

 use Vars

 n=3

 allocate(A(n))
 pA => A

 !!$acc serial present(pA)
 !$acc serial 
   A(1)=1
   pA(2)=2
   pA(3)=3
 !$acc end serial

 !!$acc update host(A)
 !!$acc kernels
 ! do i=1, n
 !    pA(i) = i
 ! end do
 !!$acc end kernels

 print *, A
 deallocate(A)
 end program foo

nvfortran -acc -o present *present.f90(rapids) root@nwzvenmy9r:/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP# ./present
1.000000000000000 2.000000000000000 3.000000000000000

Yes, the compiler will do an implicit copy of the data. Though this is bad for performance since the copy would be done every time the kernel is called. Not an issue here, but in a real code the program will end up spending most of the time copying data.

Ideally, you want to copy the data once and the beginning of the program and once at the end, and then have all computation on the data performed on the device. Data movement is one of the biggest performance bottlenecks for GPU programming and it’s best to minimize it as much as possible.