OpenACC: Declare data region in another file

I am running a Fortran model with a large number of modules and functions, and it has a derived type (grid type, G) of several arrays, and will be used widely across the model. Although we don’t have any global variables, I would like this to be copied onto our GPU early and used across many different functions.

I am trying to slowly migrate our loops to the GPU with OpenACC. If I select a loop containing G, such as this:

    do j=js,je ; do I=Isq,Ieq
      diffu(I,j,k) = ((G%IdyCu(I,j)*(CS%dy2h(i,j)*str_xx(i,j) - CS%dy2h(i+1,j)*str_xx(i+1,j)) + &
                       G%IdxCu(I,j)*(CS%dx2q(I,J-1)*str_xy(I,J-1) - CS%dx2q(I,J)*str_xy(I,J))) * &
                     G%IareaCu(I,j)) / (h_u(I,j) + h_neglect)
    enddo ; enddo

then I get an error like this (running through ncu):

FATAL ERROR: variable in data clause is partially present on the device: name=g

I can get around this by locally copying G at the beginning of the function:

!$acc enter data copyin(G, CS)

then use

!$acc kernels present(G, CS)
...

(Performance is poor, but that’s not my concern at the moment.). Although G is an argument into the function, it and its contents are essentially static across the run.

I would like to widen this operation so that multiple functions can see this copy of G. I have tried something like this, which occurs in a different file:

! NOTE: A function another file is calling this function
! horizontal_viscosity() contains the loop from above.

!$acc enter data copyin(G)
call horizontal_viscosity(..., G, ...)
!$acc exist data delete(G)

but this does not seem to have any actual association. Instead, I get the following error:

FATAL ERROR: data in PRESENT clause was not found on device 1: name=g host:0x1709ea0

At this point, I’m unsure how to proceed, or if it is even possible.

Can I create data regions outside of functions, so that the loops inside of functions can see this data?

Yes. When you add a variable to a data clause, it gets added to the “present table”. Then when looked-up in the function, the compiler checks the host address in the present table to find the associated device address.

A “partially present” error means that the host address was found, but the size is different.

Since you don’t have a reproducing example, why it’s happening here, I can’t be sure. But since “G” and “CS” are user defined types, you do need to do a deep copy of the type. Only shallow copies are done by default so only the fixed size data members are actually copied. Allocatable array members need to be copied separately.

One possibility is that the compiler is having to implicitly copy “G%IdyCu”. This would overlap the address from the earlier copyin of G. Within the same scoping unit, the compiler may be able to make the association, but cross-function may not.

For deep copies, you a few options. You can have the compiler do this for you by adding the flag “-gpu=deepcopy” so when using “acc enter data (G,CS)” the entire type is copied to the device. The caveat being that you have less fine grain control over which arrays are copied and when, and can cause a bit more overhead. Though this is typically only an issue if you have large types or only want part of the type copies.

For a manual deep copy, you’d do something like:

!$acc enter data copyin(G, CS)
!$acc enter data copyin(G%IdyCu, G%.IareaCu, ..rest of the arrays)
!$acc enter data copyin(CS%dy2h, CS%dy2h, etc.)

So long as “G” is first, the member arrays will get copied to the device and then “attached” to “G”. “attach” creates the association between the parent, G, and the arrays as well as fill in the correct device pointers.

You may find this article about manual deep copies useful, starting towards the bottom of page 4: https://developer.download.nvidia.com/assets/pgi-legacy-support/Deep-Copy-Support-in-OpenACC_PGI.pdf

This article describes “true” deep-copy, i.e. the “-gpu=deepcopy” flag. https://developer.download.nvidia.com/assets/pgi-legacy-support/True-OpenACC-Deep-Copy-Beta_PGI.pdf

Both articles are a bit old, so some of the flags have changed (like -ta=tesla:deepcopy is now -gpu=deepcopy) and this was before PGI was rebranded to NVHPC, but the concepts are the same.

Let me know if you have questions. Though if adding deep copy doesn’t fix the issue, if you could provide a minimal reproducing example, that may help me understand what’s going on.

-Mat

Thank you Mat, there is a lot of useful information here, particularly the deepcopy explanation.

My first observation is that the pointer “flattening” trick described in the first PDF document appears to work for me. I can replace all of the arrays inside derived types with pointers to arrays, and everything seems to work.

Second is that I do not appear to need copyin statements for the arrays within the function. For example, this works fine:

subroutine horizontal_viscosity(..., G, CS, ...)
...
!$acc enter data copyin(G, CS)
!$acc kernels
do j = js,ie ; do i = is,ie
  ...
enddo ; enddo
!$acc kernels end

I’m not sure why a deepcopy of the contents of G and CS were not required, but perhaps the state of OpenACC has improved. (I am on nvfortran 22.5.)

However, I wasn’t able to make any progress on getting the loop to find G if the copy happens in another file. I tried copying every allocatable in G (all 40!) but it did not seem to matter.

I apologize for not putting together a reproducible example. I will not be available for the next couple of weeks, but I will append one when I return. (In other words, I would be most appreciative if you kept this open for a bit 😅).

Thanks for your patience. I have put together a much smaller example which demonstrates the problem.

In this case, the loop uses a derived type G which contains two allocatable arrays. Both appear in the loop.

I was able to confirm the following:

  • !$acc kernels triggered the partial presence error
  • Local !$acc enter data copyin(G) resolved the problem, although this is not what I want. (I don’t want to copy on every call.)
  • !$acc kernels present(G) only worked if I also did a manual deepcopy in the main() function:
!$acc enter data copyin(G)
!$acc enter data copyin(G, G%Idx, G%Idy)

At least in my idealized case, your deepcopy suggestion seems to resolve my problem.

I have not yet been able to emulate this success in the production code. But this is a much more complex type, and I could have easily overlooked one of the fields.

I’ll link it here in case it is helpful to the discussion.

I think that what I need to do next is doublecheck this deepcopy and ensure that G is being correctly transferred to the device memory.

Thank you again for your help.

For the production code, I see that you have pointers in there. Are this allocated or assigned?

If they’re allocated then this method works.

If they assigned, then add the “copyin” after the assignment if the target isn’t on the device. It the target is already on the device, then you want to use “attach” instead.

“attach” basically fills in the device pointer in the type to the device address of the target but doesn’t allocate new memory.

There’s also “exit data detach”, which removes the pointer assignment but doesn’t deallocate it.

I should mention that if this gets to be too much of a challenge, try adding the flag “-gpu=managed”. All allocations will be put into Unified Memory which is visible to both the host and device. “G” would still need to be put in a data directive since it’s not allocated.

If you happen to be using a Grace-Hopper system, then you can use “-gpu=unified” instead, in which case all memory is visible, including stack and static host memory, so no data directives are needed. “unified” is available on x86 as well, but only with newer Linux kernels and HMM support enabled.

Thank you Mat, -gpu=managed explains some of the confusion that I was having (including a deleted comment which you may have noticed). When compiled with -gpu=managed, I only need to provide !$acc enter data copyin(G, CS). This works both inside and outside of the function. I can also migrate the example loop in my production code to our GPU.

I can also turn off managed memory and explicitly copy the fields which appear in the loop. I did not need to copy every field.

This also works in the production code, so I believe that I have finally overcome this particular hurdle. Thanks so much for your help!

1 Like