On the usage of the OpenACC `attach` clause

Hello,

I am wondering about when to use the attach clause. I have two different use cases in a Fortran code:

  1. A pointer to a 3D array that is allocated on the device, to be used inside a kernel.
  2. An array of pointers to 3D arrays that are allocated on the device, to be used inside a kernel.

See two minimal working examples below.

program p
  implicit none
  integer, parameter :: n = 50
  call bla(n)
  call bla(n)
  call bla(n)
contains
  subroutine bla(n)
    integer, intent(in) :: n
    real, allocatable, target , save :: a_t(:,:,:)
    real,              pointer, save :: a_p(:,:,:)
    logical, save :: is_first = .true.
    integer :: i,j,k
    if(is_first) then
      is_first = .false.
      allocate(a_t(n,n,n))
      a_t(:,:,:) = 0.
      a_p => a_t
      !$acc enter data create(a_t)
      !!$acc enter data attach(a_p) ! **NOT NEEDED**
    end if
    !
    !$acc parallel loop collapse(3) default(present)
    do k=1,n
      do j=1,n
        do i=1,n
          a_p(i,j,k) = a_p(i,j,k) + 1.
        end do
      end do
    end do
    !$acc update self(a_p)
    print*,a_p(5,5,5)
  end subroutine bla
end program p

This program doesn’t need the !$acc enter data attach(a_p) directive to run correctly:

$ nvfortran -acc test.f90 && ./a.out
    1.000000
    2.000000
    3.000000

Now, in this program:

program p
  implicit none
  integer, parameter :: n = 50
  call bla(n)
  call bla(n)
  call bla(n)
contains
  subroutine bla(n)
    integer, intent(in) :: n
    type :: arr
      real,          allocatable :: s(:,:,:)
    end type arr
    type :: arr_ptr
      real, pointer,  contiguous :: s(:,:,:)
    end type arr_ptr
    type(arr)    , allocatable, target, save :: a_t(:)
    type(arr_ptr), allocatable        , save :: a_p(:)
    logical, save :: is_first = .true.
    integer :: i,j,k
    if(is_first) then
      is_first = .false.
      allocate(a_t(2))
      allocate(a_t(1)%s(n,n,n))
      a_t(1)%s(:,:,:) = 0.
      allocate(a_p(1))
      a_p(1)%s => a_t(1)%s
      !$acc enter data create(a_t,a_p)
      !$acc enter data copyin(a_t(1)%s)
      !!$acc enter data attach(a_p(1)%s) ! **NEEDED**
    end if
    !
    !$acc parallel loop collapse(3) default(present)
    do k=1,n
      do j=1,n
        do i=1,n
          a_p(1)%s(i,j,k) = a_p(1)%s(i,j,k) + 1.
        end do
      end do
    end do
    !$acc update self(a_p(1)%s)
    print*,a_p(1)%s(5,5,5)
  end subroutine bla
end program p

Does need the !$acc enter data attach, else:

$ nvfortran -acc test_attach.f90 && ./a.out
Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution

For the first program, is this valid OpenACC, or is the compiler ensuring the “attachment”? In the latter case, would a compiler output Generating implicit attach, similar to other outputs, make sense here (although perhaps the compiler is not generating OpenACC code to ensure the “attachment”)?

If the code is not (OpenACC) standard-conforming, I guess it would be good to always use an attach directive in cases like this.

Thanks in advance!
Pedro

In your first example, a_p is a top-level entity. The data clauses really only refer to the data, your 3 dimensional array. When the runtime hits the kernel, the device data associated with the host address pointed to by a_p IS present, so all is fine. Any storage for the pointer a_p doesn’t even need to be on the device (as an optimization). Any bounds info needed is probably passed in as kernel arguments.

In your second example, the pointer “s” is inside of another entity, a derived type a_p. There is storage associated with the pointer, to hold the bounds etc, along with a pointer to the data, all within a_p. So, when you do a data create on a_p, we duplicate the contents of a_p on the device. But, the compiler doesn’t necessarily know much or anything about the pointer at that point. It might not even be “associated”, in the Fortran sense. We just do a memcpy on the contents. So, then, once the 3 dimensional array is copied to the GPU, as you do, all that is left is to “attach” the pointer (and maybe the dimensions and such info) correctly, so that the pointer within a_p actually points to the device array. You need to do that because within the kernel in your 2nd example, since “s” is inside a derived type, the compiler uses known offsets into the top-level type to locate the bounds and the data pointer.