Openacc routine directive

yongsuk · March 27, 2024, 6:10am

Hello,

Thank you for your previous replies. I could get pretty good speed-ups in my MPI-GPU hybrid code. It is very promising. Moving forward, I’m thinking of using async and routine directives to hide memory latency.

Based on documentation and other forum discussions, I could write the following code structure, but I’m a bit confused about how to use ACC routine directives in Fortran. Could you comment on this reproducing example? I declared the arrays to be used in the device subroutine AAA and used default(present) cluase to notify in the outer loop.

program routine_example
use openacc
implicit none

integer,parameter :: &
    NPmax=100000, &
    NPNmax = 300, &
    NSD = 3       &
    TIMESTEPS = 10000 
integer,allocatable :: npnl(:), pair(:,:)
real(8),allocatable :: mass(:), at(:,:), p(:), rho(:), dwdx(:,:,:)
!$acc declare create(npnl,pair,mass,at,p,rho,dwdx)

integer :: ii,i,err

!$acc routine(AAA)

allocate(npnl(NPmax), pair(NPNmax,NPmax), mass(NPmax), at(NSD,NPmax), &
    p(NPmax), rho(NPmax), dwdx(NSD,NPNmax,NPmax), stat=err)
if(err/=0) then 
    print '(A)', '  DYNAMIC ALLOCATION ERROR  '
end if

!$acc data copy(npnl,pair,mass,at,p,rho,dwdx)

do ii=1,TIMESTEPS

! ... other prallel constructs for async ...

!$acc parallel loop independent gang vector default(present) async(7)
do i=1,NPmax
call AAA
!call BBB 
!call CCC 
end do

! ... other parallel constructs for async ...

!$acc wait

end do 

!$acc end data

end program

subroutine AAA
!$acc routine
use openacc
implicit none
integer :: i,j,k
real(8) :: sr,vr(3)

!$acc loop seq private(j,sr,vr)
do k=1,npnl(i)

	j = pair(k,i)

        sr = (p(i)+p(j))/(rho(i)*rho(j))
        vr = sr*dwdx(:,k,i)

        at(:,i) = at(:,i) + mass(j)*vr
end do
end subroutine

Thank you

MatColgrove · March 27, 2024, 4:15pm

Hi Yongsuk,

The example has multiple Fortran errors which I’ll ignore except in the context of your question regarding “routine”.

How you’re using “routine” itself is correct. The directive needs to be visible from both the callee and caller so the compiler knows to create the device routine and to know a device version is available to be called. Since you’re using F77-style calling conventions, you are correctly using it twice. If you had an interface (either explicit or implicit if defined in a module), then it only needs to be added once in the interface itself.

Since “AAA” does have a parallel loop, it may be beneficial to declare “AAA” as a “routine vector” and then use “acc loop vector” instead of “seq”. You would then also remove the “vector” from the outer “parallel loop”. Basically pushing the vector level parallel dimension into the routine.

Granted, this may or may not be beneficial to performance. It largely depends on the size of the loops. For example if “npn1(i)” is less than 32, you would want it run sequentially. Though it’s easy to experiment so worth a try.

Note that calling routines on the device can have a negative impact on performance. It takes about 150 registers to perform a call, thus lowering the occupancy. Also in order to support reductions inside of “routine vector” and avoid syncing threads, the vector length is lowered to 32. Instead, you’ll want to see if the routine can be inlined (via the -Minline flag).

One error that I see is that the arrays being access in “AAA” are local to the main program. They can’t actually be accessed as you have it written. You have three options here:

Pass the arrays as arguments
Put them in a module
Use a contained routine

I don’t recommend using a contained subroutine. How they work is that a hidden pointer to the parent’s stack is passed to the subroutine. This allows the child routine to directly access the parent’s local variables without having to pass them as arguments. However, this address would be to the host stack, not the device, so unless you’re using “-gpu=unified” on a Grace-Hopper system, the host stack is not accessible.

Your use of “declare create” is correct and will work the same if you use a module. “declare” is a data region that has the same scope and lifetime as the scoping unit in which it’s declared.

The one issue is that you then put these variables in another “data copy” directive. This doesn’t necessary hurt, but it’s not doing what you expect. “copy” is really a “present_or_copy”, meaning if the data is already present on the device, which it is here, no copy is performed. Basically it’s a no-op.

Instead, you’ll want to use an “update device” directive after the arrays are initialized, to sync the device and host copies.

One subtlety is that given these are allocatables, the device array data is implicitly allocated at the same time the host array is allocated. However if you’re using raw C pointers, then the “data create/copy” is needed to create the data since the “declare” would only be for the pointer itself. Though you’d want to be sure add the bounds info in the create. Hopefully this doesn’t cause confusion, but I wanted to explain a bit further on why you don’t need the data copy here. Fortran is a higher level language than C so the compiler can do more things implicitly.

Another error in your program is the “i” in AAA is local and uninitialized. Be sure to pass in the caller’s “i” since I presume that’s what you indended.

-Mat

yongsuk · March 27, 2024, 9:35pm

Hi Mat,

Thank you for your answer in detail. I could find out error in my code through your answer.

"Since “AAA” does have a parallel loop, it may be beneficial to declare “AAA” as a “routine vector” and then use “acc loop vector” instead of “seq”. You would then also remove the “vector” from the outer “parallel loop”. Basically pushing the vector level parallel dimension into the routine.

Granted, this may or may not be beneficial to performance. It largely depends on the size of the loops. For example if “npn1(i)” is less than 32, you would want it run sequentially. Though it’s easy to experiment so worth a try."

→ Thanks for this comment. I tested by increasing the npnl(i) size ranging from 1~300 and found out that there is a tradeoff between another subroutine. However, as AAA is the most time-consuming subroutine, it is beneficial to parallelize the inner loop.

“Note that calling routines on the device can have a negative impact on performance. It takes about 150 registers to perform a call, thus lowering the occupancy. Also in order to support reductions inside of “routine vector” and avoid syncing threads, the vector length is lowered to 32. Instead, you’ll want to see if the routine can be inlined (via the -Minline flag).”

→ The -Minline flag helped speed up the simulation.

"The one issue is that you then put these variables in another “data copy” directive. This doesn’t necessary hurt, but it’s not doing what you expect. “copy” is really a “present_or_copy”, meaning if the data is already present on the device, which it is here, no copy is performed. Basically it’s a no-op.

Instead, you’ll want to use an “update device” directive after the arrays are initialized, to sync the device and host copies."

→ This was the error in my code. Thanks!

“One subtlety is that given these are allocatables, the device array data is implicitly allocated at the same time the host array is allocated. However if you’re using raw C pointers, then the “data create/copy” is needed to create the data since the “declare” would only be for the pointer itself. Though you’d want to be sure add the bounds info in the create. Hopefully this doesn’t cause confusion, but I wanted to explain a bit further on why you don’t need the data copy here. Fortran is a higher level language than C so the compiler can do more things implicitly.”

→ This was what I intended for Fortran code to do. I thought the “declare” would be the pointer-ish to the device.

Thank you again,
Yongsuk

system · April 10, 2024, 9:35pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Openacc fortran acc routine error [nvlink error : undefined reference to 'subroutine_name_' in 'file_name'] Legacy PGI Compilers	5	1366	March 3, 2023
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	32	December 28, 2024
acc routine and Fortran Legacy PGI Compilers	6	14076	March 13, 2014
OpenACC Accelerator restriction: call to 'function' with no acc routine information nvc, nvc++ and nvfortran	9	522	November 26, 2024
Hybrid runs on CPU and GPU - OpenACC nvc, nvc++ and nvfortran openmpi	6	1460	May 23, 2022
OpenACC: cuStreamSynchronize crash when using pointers as parameters nvc, nvc++ and nvfortran	4	813	December 7, 2021
Handling global variables in OpenACC kernels nvc, nvc++ and nvfortran	14	1008	August 14, 2023
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	683	May 31, 2023
OpenACC FORTRAN pointer how-to question nvc, nvc++ and nvfortran	5	1160	December 19, 2023
Unstructured copyin vs create + update nvc, nvc++ and nvfortran	10	423	May 18, 2024

Openacc routine directive

Related topics