OpenACC: Declare data region in another file

marshall.ward · July 12, 2024, 4:33pm

I am running a Fortran model with a large number of modules and functions, and it has a derived type (grid type, G) of several arrays, and will be used widely across the model. Although we don’t have any global variables, I would like this to be copied onto our GPU early and used across many different functions.

I am trying to slowly migrate our loops to the GPU with OpenACC. If I select a loop containing G, such as this:

    do j=js,je ; do I=Isq,Ieq
      diffu(I,j,k) = ((G%IdyCu(I,j)*(CS%dy2h(i,j)*str_xx(i,j) - CS%dy2h(i+1,j)*str_xx(i+1,j)) + &
                       G%IdxCu(I,j)*(CS%dx2q(I,J-1)*str_xy(I,J-1) - CS%dx2q(I,J)*str_xy(I,J))) * &
                     G%IareaCu(I,j)) / (h_u(I,j) + h_neglect)
    enddo ; enddo

then I get an error like this (running through ncu):

FATAL ERROR: variable in data clause is partially present on the device: name=g

I can get around this by locally copying G at the beginning of the function:

!$acc enter data copyin(G, CS)

then use

!$acc kernels present(G, CS)
...

(Performance is poor, but that’s not my concern at the moment.). Although G is an argument into the function, it and its contents are essentially static across the run.

I would like to widen this operation so that multiple functions can see this copy of G. I have tried something like this, which occurs in a different file:

! NOTE: A function another file is calling this function
! horizontal_viscosity() contains the loop from above.

!$acc enter data copyin(G)
call horizontal_viscosity(..., G, ...)
!$acc exist data delete(G)

but this does not seem to have any actual association. Instead, I get the following error:

FATAL ERROR: data in PRESENT clause was not found on device 1: name=g host:0x1709ea0

At this point, I’m unsure how to proceed, or if it is even possible.

Can I create data regions outside of functions, so that the loops inside of functions can see this data?

MatColgrove · July 12, 2024, 10:21pm

Yes. When you add a variable to a data clause, it gets added to the “present table”. Then when looked-up in the function, the compiler checks the host address in the present table to find the associated device address.

A “partially present” error means that the host address was found, but the size is different.

Since you don’t have a reproducing example, why it’s happening here, I can’t be sure. But since “G” and “CS” are user defined types, you do need to do a deep copy of the type. Only shallow copies are done by default so only the fixed size data members are actually copied. Allocatable array members need to be copied separately.

One possibility is that the compiler is having to implicitly copy “G%IdyCu”. This would overlap the address from the earlier copyin of G. Within the same scoping unit, the compiler may be able to make the association, but cross-function may not.

For deep copies, you a few options. You can have the compiler do this for you by adding the flag “-gpu=deepcopy” so when using “acc enter data (G,CS)” the entire type is copied to the device. The caveat being that you have less fine grain control over which arrays are copied and when, and can cause a bit more overhead. Though this is typically only an issue if you have large types or only want part of the type copies.

For a manual deep copy, you’d do something like:

!$acc enter data copyin(G, CS)
!$acc enter data copyin(G%IdyCu, G%.IareaCu, ..rest of the arrays)
!$acc enter data copyin(CS%dy2h, CS%dy2h, etc.)

So long as “G” is first, the member arrays will get copied to the device and then “attached” to “G”. “attach” creates the association between the parent, G, and the arrays as well as fill in the correct device pointers.

You may find this article about manual deep copies useful, starting towards the bottom of page 4: https://developer.download.nvidia.com/assets/pgi-legacy-support/Deep-Copy-Support-in-OpenACC_PGI.pdf

This article describes “true” deep-copy, i.e. the “-gpu=deepcopy” flag. https://developer.download.nvidia.com/assets/pgi-legacy-support/True-OpenACC-Deep-Copy-Beta_PGI.pdf

Both articles are a bit old, so some of the flags have changed (like -ta=tesla:deepcopy is now -gpu=deepcopy) and this was before PGI was rebranded to NVHPC, but the concepts are the same.

Let me know if you have questions. Though if adding deep copy doesn’t fix the issue, if you could provide a minimal reproducing example, that may help me understand what’s going on.

-Mat

marshall.ward · July 13, 2024, 3:41pm

Thank you Mat, there is a lot of useful information here, particularly the deepcopy explanation.

My first observation is that the pointer “flattening” trick described in the first PDF document appears to work for me. I can replace all of the arrays inside derived types with pointers to arrays, and everything seems to work.

Second is that I do not appear to need copyin statements for the arrays within the function. For example, this works fine:

subroutine horizontal_viscosity(..., G, CS, ...)
...
!$acc enter data copyin(G, CS)
!$acc kernels
do j = js,ie ; do i = is,ie
  ...
enddo ; enddo
!$acc kernels end

I’m not sure why a deepcopy of the contents of G and CS were not required, but perhaps the state of OpenACC has improved. (I am on nvfortran 22.5.)

However, I wasn’t able to make any progress on getting the loop to find G if the copy happens in another file. I tried copying every allocatable in G (all 40!) but it did not seem to matter.

I apologize for not putting together a reproducible example. I will not be available for the next couple of weeks, but I will append one when I return. (In other words, I would be most appreciative if you kept this open for a bit 😅).

marshall.ward · August 2, 2024, 5:44pm

Thanks for your patience. I have put together a much smaller example which demonstrates the problem.

In this case, the loop uses a derived type G which contains two allocatable arrays. Both appear in the loop.

I was able to confirm the following:

!$acc kernels triggered the partial presence error
Local !$acc enter data copyin(G) resolved the problem, although this is not what I want. (I don’t want to copy on every call.)
!$acc kernels present(G) only worked if I also did a manual deepcopy in the main() function:

!$acc enter data copyin(G)
!$acc enter data copyin(G, G%Idx, G%Idy)

At least in my idealized case, your deepcopy suggestion seems to resolve my problem.

I have not yet been able to emulate this success in the production code. But this is a much more complex type, and I could have easily overlooked one of the fields.

I’ll link it here in case it is helpful to the discussion.

github.com

NOAA-GFDL/MOM6/blob/e30a6e7118171737404bb79265f9d82058e0593a/src/core/MOM_grid.F90#L26-L199


      
          type, public :: ocean_grid_type
            type(MOM_domain_type), pointer :: Domain => NULL() !< Ocean model domain
            type(MOM_domain_type), pointer :: Domain_aux => NULL() !< A non-symmetric auxiliary domain type.
            type(hor_index_type) :: HI !< Horizontal index ranges
            type(hor_index_type) :: HId2 !< Horizontal index ranges for level-2-downsampling
          
            integer :: isc !< The start i-index of cell centers within the computational domain
            integer :: iec !< The end i-index of cell centers within the computational domain
            integer :: jsc !< The start j-index of cell centers within the computational domain
            integer :: jec !< The end j-index of cell centers within the computational domain
          
            integer :: isd !< The start i-index of cell centers within the data domain
            integer :: ied !< The end i-index of cell centers within the data domain
            integer :: jsd !< The start j-index of cell centers within the data domain
            integer :: jed !< The end j-index of cell centers within the data domain
          
            integer :: isg !< The start i-index of cell centers within the global domain
            integer :: ieg !< The end i-index of cell centers within the global domain
            integer :: jsg !< The start j-index of cell centers within the global domain
            integer :: jeg !< The end j-index of cell centers within the global domain

This file has been truncated. show original

I think that what I need to do next is doublecheck this deepcopy and ensure that G is being correctly transferred to the device memory.

Thank you again for your help.

MatColgrove · August 2, 2024, 8:01pm

For the production code, I see that you have pointers in there. Are this allocated or assigned?

If they’re allocated then this method works.

If they assigned, then add the “copyin” after the assignment if the target isn’t on the device. It the target is already on the device, then you want to use “attach” instead.

“attach” basically fills in the device pointer in the type to the device address of the target but doesn’t allocate new memory.

There’s also “exit data detach”, which removes the pointer assignment but doesn’t deallocate it.

MatColgrove · August 2, 2024, 9:16pm

I should mention that if this gets to be too much of a challenge, try adding the flag “-gpu=managed”. All allocations will be put into Unified Memory which is visible to both the host and device. “G” would still need to be put in a data directive since it’s not allocated.

If you happen to be using a Grace-Hopper system, then you can use “-gpu=unified” instead, in which case all memory is visible, including stack and static host memory, so no data directives are needed. “unified” is available on x86 as well, but only with newer Linux kernels and HMM support enabled.

marshall.ward · August 5, 2024, 4:46pm

Thank you Mat, -gpu=managed explains some of the confusion that I was having (including a deleted comment which you may have noticed). When compiled with -gpu=managed, I only need to provide !$acc enter data copyin(G, CS). This works both inside and outside of the function. I can also migrate the example loop in my production code to our GPU.

I can also turn off managed memory and explicitly copy the fields which appear in the loop. I did not need to copy every field.

This also works in the production code, so I believe that I have finally overcome this particular hurdle. Thanks so much for your help!

Topic		Replies	Views
Implicit OpenACC copies and full deep copies nvc, nvc++ and nvfortran	9	1409	July 13, 2022
"OpenACC" deepcopy support in current nvc releases? It works, but is it supported? nvc, nvc++ and nvfortran omniverse_extension	9	1112	February 23, 2022
OpenACC copy clause for pointer member of struct does not get attached if parent is copied implicitly nvc, nvc++ and nvfortran	5	1116	February 24, 2022
Handling global variables in OpenACC kernels nvc, nvc++ and nvfortran	14	962	August 14, 2023
Openacc routine directive nvc, nvc++ and nvfortran	3	455	March 27, 2024
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	670	May 31, 2023
In OpenACC Fortran, 1. how to use private pointer variables, 2. How to deal with derived type variables with allocable variables nvc, nvc++ and nvfortran	5	450	December 20, 2023
Openacc fortran acc routine error [nvlink error : undefined reference to 'subroutine_name_' in 'file_name'] Legacy PGI Compilers	5	1330	March 3, 2023
Openacc fortran pointer multi-dimension array Legacy PGI Compilers	3	654	June 9, 2023
problem with !$acc declare create() Legacy PGI Compilers	6	5969	November 10, 2017

OpenACC: Declare data region in another file

Related topics