Openacc routine directive

Hello,

Thank you for your previous replies. I could get pretty good speed-ups in my MPI-GPU hybrid code. It is very promising. Moving forward, I’m thinking of using async and routine directives to hide memory latency.

Based on documentation and other forum discussions, I could write the following code structure, but I’m a bit confused about how to use ACC routine directives in Fortran. Could you comment on this reproducing example? I declared the arrays to be used in the device subroutine AAA and used default(present) cluase to notify in the outer loop.

program routine_example
use openacc
implicit none

integer,parameter :: &
    NPmax=100000, &
    NPNmax = 300, &
    NSD = 3       &
    TIMESTEPS = 10000 
integer,allocatable :: npnl(:), pair(:,:)
real(8),allocatable :: mass(:), at(:,:), p(:), rho(:), dwdx(:,:,:)
!$acc declare create(npnl,pair,mass,at,p,rho,dwdx)

integer :: ii,i,err

!$acc routine(AAA)

allocate(npnl(NPmax), pair(NPNmax,NPmax), mass(NPmax), at(NSD,NPmax), &
    p(NPmax), rho(NPmax), dwdx(NSD,NPNmax,NPmax), stat=err)
if(err/=0) then 
    print '(A)', '  DYNAMIC ALLOCATION ERROR  '
end if

!$acc data copy(npnl,pair,mass,at,p,rho,dwdx)

do ii=1,TIMESTEPS

! ... other prallel constructs for async ...

!$acc parallel loop independent gang vector default(present) async(7)
do i=1,NPmax
call AAA
!call BBB 
!call CCC 
end do

! ... other parallel constructs for async ...

!$acc wait

end do 

!$acc end data

end program

subroutine AAA
!$acc routine
use openacc
implicit none
integer :: i,j,k
real(8) :: sr,vr(3)

!$acc loop seq private(j,sr,vr)
do k=1,npnl(i)

	j = pair(k,i)

        sr = (p(i)+p(j))/(rho(i)*rho(j))
        vr = sr*dwdx(:,k,i)

        at(:,i) = at(:,i) + mass(j)*vr
end do
end subroutine

Thank you

Hi Yongsuk,

The example has multiple Fortran errors which I’ll ignore except in the context of your question regarding “routine”.

How you’re using “routine” itself is correct. The directive needs to be visible from both the callee and caller so the compiler knows to create the device routine and to know a device version is available to be called. Since you’re using F77-style calling conventions, you are correctly using it twice. If you had an interface (either explicit or implicit if defined in a module), then it only needs to be added once in the interface itself.

Since “AAA” does have a parallel loop, it may be beneficial to declare “AAA” as a “routine vector” and then use “acc loop vector” instead of “seq”. You would then also remove the “vector” from the outer “parallel loop”. Basically pushing the vector level parallel dimension into the routine.

Granted, this may or may not be beneficial to performance. It largely depends on the size of the loops. For example if “npn1(i)” is less than 32, you would want it run sequentially. Though it’s easy to experiment so worth a try.

Note that calling routines on the device can have a negative impact on performance. It takes about 150 registers to perform a call, thus lowering the occupancy. Also in order to support reductions inside of “routine vector” and avoid syncing threads, the vector length is lowered to 32. Instead, you’ll want to see if the routine can be inlined (via the -Minline flag).

One error that I see is that the arrays being access in “AAA” are local to the main program. They can’t actually be accessed as you have it written. You have three options here:

  1. Pass the arrays as arguments
  2. Put them in a module
  3. Use a contained routine

I don’t recommend using a contained subroutine. How they work is that a hidden pointer to the parent’s stack is passed to the subroutine. This allows the child routine to directly access the parent’s local variables without having to pass them as arguments. However, this address would be to the host stack, not the device, so unless you’re using “-gpu=unified” on a Grace-Hopper system, the host stack is not accessible.

Your use of “declare create” is correct and will work the same if you use a module. “declare” is a data region that has the same scope and lifetime as the scoping unit in which it’s declared.

The one issue is that you then put these variables in another “data copy” directive. This doesn’t necessary hurt, but it’s not doing what you expect. “copy” is really a “present_or_copy”, meaning if the data is already present on the device, which it is here, no copy is performed. Basically it’s a no-op.

Instead, you’ll want to use an “update device” directive after the arrays are initialized, to sync the device and host copies.

One subtlety is that given these are allocatables, the device array data is implicitly allocated at the same time the host array is allocated. However if you’re using raw C pointers, then the “data create/copy” is needed to create the data since the “declare” would only be for the pointer itself. Though you’d want to be sure add the bounds info in the create. Hopefully this doesn’t cause confusion, but I wanted to explain a bit further on why you don’t need the data copy here. Fortran is a higher level language than C so the compiler can do more things implicitly.

Another error in your program is the “i” in AAA is local and uninitialized. Be sure to pass in the caller’s “i” since I presume that’s what you indended.

-Mat

1 Like

Hi Mat,

Thank you for your answer in detail. I could find out error in my code through your answer.

"Since “AAA” does have a parallel loop, it may be beneficial to declare “AAA” as a “routine vector” and then use “acc loop vector” instead of “seq”. You would then also remove the “vector” from the outer “parallel loop”. Basically pushing the vector level parallel dimension into the routine.

Granted, this may or may not be beneficial to performance. It largely depends on the size of the loops. For example if “npn1(i)” is less than 32, you would want it run sequentially. Though it’s easy to experiment so worth a try."

→ Thanks for this comment. I tested by increasing the npnl(i) size ranging from 1~300 and found out that there is a tradeoff between another subroutine. However, as AAA is the most time-consuming subroutine, it is beneficial to parallelize the inner loop.

“Note that calling routines on the device can have a negative impact on performance. It takes about 150 registers to perform a call, thus lowering the occupancy. Also in order to support reductions inside of “routine vector” and avoid syncing threads, the vector length is lowered to 32. Instead, you’ll want to see if the routine can be inlined (via the -Minline flag).”

→ The -Minline flag helped speed up the simulation.

"The one issue is that you then put these variables in another “data copy” directive. This doesn’t necessary hurt, but it’s not doing what you expect. “copy” is really a “present_or_copy”, meaning if the data is already present on the device, which it is here, no copy is performed. Basically it’s a no-op.

Instead, you’ll want to use an “update device” directive after the arrays are initialized, to sync the device and host copies."

→ This was the error in my code. Thanks!

“One subtlety is that given these are allocatables, the device array data is implicitly allocated at the same time the host array is allocated. However if you’re using raw C pointers, then the “data create/copy” is needed to create the data since the “declare” would only be for the pointer itself. Though you’d want to be sure add the bounds info in the create. Hopefully this doesn’t cause confusion, but I wanted to explain a bit further on why you don’t need the data copy here. Fortran is a higher level language than C so the compiler can do more things implicitly.”

→ This was what I intended for Fortran code to do. I thought the “declare” would be the pointer-ish to the device.

Thank you again,
Yongsuk

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.