Implicit OpenACC copies and full deep copies

Running the following program when compiled with nvfortran -acc=gpu -gpu=deepcopy boring.f90 doesn’t successfully complete unless a copy to device memory is explicitly made. Tested with nvfortran 22.5. Is this the intended behaviour? At first sight I’d have thought the aggregate variable would be treated as if appearing in a copy clause anyway according to the OpenACC standard, but I could be wrong.

program boring

implicit none

integer, parameter :: n = 10
integer :: i

type :: struct
  integer, dimension(:), pointer :: array
end type struct
type(struct) :: a, b

allocate(a % array(n), b % array(n))

a % array = (/ (i, i=1,n) /)

!$acc kernels !copy(a, b) ! uncommenting produces intended behaviour
b % array(:) = a % array(:)
!$acc end kernels

write(*,*) b % array(n / 2)

end program boring

Thank you,
-Nuno

I believe so. “deepcopy” works within data clauses. Without the “copy” clause, the compiler needs to perform an implicit copy which is a different mechanism.

-Mat

Ok, I’m happy with that, but I’m a bit confused regarding the implicit copy mechanism. Take a simpler variant of the previous code:

program boring

implicit none

integer, parameter :: n = 10
integer :: i

type :: struct
  integer, dimension(:), allocatable :: array
end type struct
type(struct) :: a

allocate(a % array(n))

!$acc kernels !copyin(a) copy(a%array(:))
a % array = 1
!$acc end kernels

write(*,*) a % array

end program boring
  • with nvfortran -acc=gpu -Minfo=all -o boring boring.f90, we get:
    boring:
        15, Generating implicit copyin(a) [if not already present]
            Generating implicit copy(a%array(:)) [if not already present]
        16, Loop is parallelizable
            Generating NVIDIA GPU code
            16, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    
    and the code crashes, which I would expect in case a weren’t being copied in, but the compiler claims it is?
  • with nvfortran -acc=gpu -gpu=deepcopy -Minfo=all -o boring boring.f90, we get:
    boring:
         15, Generating implicit copyin(a) [if not already present]
             Generating implicit copy(a%array(:)) [if not already present]
         16, Loop is parallelizable
             Generating NVIDIA GPU code
             16, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    
    which is the same as above, but the program now completes albeit with unexpected results.
  • if I uncomment the manual copies above which the compiler claims it’s doing anyway and compile with nvfortran -acc=gpu -Minfo=all -o boring boring.f90, the program executes as expected.

Thank you for your help,
-Nuno

The implicit mechanism doesn’t support deep copy since it can’t do the attach operation, i.e. “a” and “array” are implicitly copied separately but aren’t associated. Since shallow copies are done, “a” would still have a host address for “array”. Hence when accessing “a%array”, the program gets the illegal address error.

I suspect in the second case, the deepcopy of “a” is occurring, but the second implicit copy of “array” is causing issues.

Deep copies must be done via data clause, either via manual deep copy, or true deep copy (i.e. the -gpu=deepcopy).

The implicit mechanism doesn’t support deep copy since it can’t do the attach operation, i.e. “a” and “array” are implicitly copied separately but aren’t associated. Since shallow copies are done, “a” would still have a host address for “array”. Hence when accessing “a%array”, the program gets the illegal address error.

Ok, that makes sense because I didn’t ask for deep copies. But it is my opinion the compiler shouldn’t report it’s placing an implicit copy since, according to the standard, that would imply an attach action is being performed. I know this automatic support for deep copies via a compiler flag isn’t covered by the standard, but I still think the compiler shouldn’t be reporting what it is. Does that sound fair?

I suspect in the second case, the deepcopy of “a” is occurring, but the second implicit copy of “array” is causing issues.

This is a bit odd, specially because the compiler report doesn’t give me any hints about the different behaviour. In addition, the version with the manual deep copy can be compiled with -gpu=deepcopy without any problems, so the fact that I explicit copy “array”, effectively repeating the copy in that case, is harmless. The only documentation we have is: “-gpu=deepcopy: Enable full deep copy of aggregate data structions in OpenACC; Fortran only”, can all this be clarified? Perhaps a note on undefined behaviour, which would also excuse the misleading compiler reports? Also, is “structions” a typo?

But it is my opinion the compiler shouldn’t report it’s placing an implicit copy since, according to the standard, that would imply an attach action is being performed.

I’d need you to point me to what section you’re looking at, but the implicit attach would occur when explicitly copying an allocatable member when the parent has already been created.

For the copy of ‘a’, section 2.7.1 lines 1540 - 1542 would apply:

1540 In Fortran, if a variable or array of composite type appears, all the members of that derived
1541 type are allocated and copied, as appropriate. If any member has the allocatable or
1542 pointer attribute, the data accessed through that member are not copied.

For composite types, the complier can’t implicitly deep copy them. The standard way of doing this would be via a manual deep copy using data clauses. We do offer the “deepcopy” flag as an extension, but that’s not designed to work with implicit copies.

For documentation, Michael Wolfe wrote a few detailed blog posts about manual and true deep copy:

These were written before rebranding of PGI to NV HPC so some of the flags have changed (specifically -ta=tesla:deepcopy is now -gpu=deepcopy), but the information is still relevant. He does discuss the proposed “shape” and “policy” directives, but these didn’t get adopted in to the OpenACC spec. We still support them, but as an extension.

Also, is “structions” a typo?

Yes, most likely it should be “structure”. I’ll let our folks know.

Sections 2.6.2 on Implicitly Determined Data Attributes and 2.7.6 on the copy clause. You will tell me, and I would agree, that this does not strictly apply because the behaviour we are talking about here is not covered by the standard, and it’s rather an extension provided by Nvidia, i.e. relying on -gpu=deepcopy. My argument is simply that, being an extension it needs appropriate documentation because I cannot resort to the standard, I’m just asking for the following to become part of the documentation:

and perhaps for the compiler not to tell me it’s adding a copy clause, because as far as the standard is concerned that has a very specific meaning.

I’m indeed aware of the articles you linked to, I do appreciate it. Indeed, one reads (emphasis mine):

Right now, the PGI compilers have a command line flag to enable implicit full deep copy, where all allocatable or pointer members of a derived type will be processed whenever a derived type is processed.

When the compiler is telling me it’s generating an implicit copyin for a, couldn’t that be understood as a derived type being processed? I’m perfectly happy to accept my example is definitely not covered by the standard and that it is also not covered by your compiler extension, but then I think it should be documented as undefined behaviour or similar.

Thanks a lot Mat, these discussions are really helpful,
-Nuno

Hi Nuno,

Not trying to be argumentative, but I prefer our team focus on documenting best practices, which hopefully Michael’s articles illustrate. True deep copy is used the same as manual deep copy without the need to copy data members.

It’s best to not to rely on implicit copying of data. There’s only so much the compiler can do (as Michael likes to say “there’s no magic feather”) and the programmer needs to help. Ultimately we’d like to remove the need for managing data altogether. Unified Memory helps here, but until UM can support static and host stack data in addition to dynamic data, we’re not able to fully get there.

For data management, it’s best practice to take a top down approach using data regions. Waiting until the compute region to copy data, either implicitly or explicitly, causes excessive data movement and poor performance. Although this doesn’t matter in your toy example, it does in larger applications.

Here’s the template I’d recommend you use:

  1. Use unstructured data regions to create the data at the same spot at which the variable is declared or allocated
  2. Use update directives to explicit copy data
  3. Use the “present” clause, or “default(present)” so implicit copy isn’t performed
  4. add exit data directives before the data is deallocated or the variable goes out of scope

This strategy makes is to the lifetime and scope of the host and device copies match. Use of the update directives allows for greater control of when data movement occurs.

% cat test.F90
program boring

implicit none

integer, parameter :: n = 10
integer :: i

type :: struct
  integer, dimension(:), allocatable :: array
end type struct
type(struct) :: a

allocate(a % array(n))
!$acc enter data create(a)
!! comment out the next line is using -gpu=deepcopy
!$acc enter data create(a%array(:n))

!$acc kernels present(a)
a % array = 1
!$acc end kernels

!$acc update self(a%array(:n))

write(*,*) a % array
!$acc exit data delete(a%array)
!$acc exit data delete(a)

end program boring
% nvfortran -acc -Minfo=accel  test.F90
boring:
     14, Generating enter data create(a)
     16, Generating enter data create(a%array(:10))
     19, Generating present(a)
     20, Loop is parallelizable
         Generating NVIDIA GPU code
         20, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     23, Generating update self(a%array(:10))
     26, Generating exit data delete(a%array(:))
     27, Generating exit data delete(a)
% a.out
            1            1            1            1            1            1
            1            1            1            1

Hope this helps,
Mat

Hi Mat,

At the risk of adding fuel to the fire, here’s something I found, by chance really…

If you take the simpler variant of the boring program I provided above, declare array as pointer instead of allocatable and compile with -gpu=deepcopy or -O0 (the latter still puzzles me), then the program does indeed execute correctly, even without the explicit copies.
Defining NV_ACC_NOTIFY=3 at runtime is really useful in this case as it reveals the pointers are being attached correctly.

-Nuno

Hi Nuno,

At the risk of adding fuel to the fire, here’s something I found, by chance really…

Not a problem.

I see this too but it’s just luck. At -O0 “a” happens to be copied before “array” so that the attach can occur. At higher opt levels, “array” gets copied first so no attach.

Though this did get me to question if we could force the order so that the parent is always copied before the children when doing the implicit copy so the attach happens. I asked Michael Wolfe, but the short answer is no, at least not during compilation. The problem being due to aliasing which can’t always be resolved at compile time. Though he said that one of our engineers is looking at making changes to the runtime, where aliasing can be resolved, so perhaps it can be solved there.

Though for the time being, you will still want to add the explicit copy clause. Plus even if we can get in this support, other compilers may not, so relying on it would cause portability issues.

1 Like