Question regarding copyin and copyout

Hi,

I have two questions regarding the code below.

The first question is regarding the copyin clause. In each solver routine call new data is generated by the gen_data(…) method on the host side. This data needs to be copied to the device in each solver call. Is my understanding correct that in the first call to the solver routine memory is allocated on the device and the data is copied from the host to the device. But in the second and later calls no new memory is allocated but the new data generated from gen_data(…) is copied to the device?

The second question is regarding the copyout clause. In the code below the solver routine is called several times from the main loop. At the end of the solver routine I want to update the data on the host. Until now I used the update host(…) directive but I want to make it more visible at the beginning of the data region which data is updated so I inserted a copyout clause. But with the copyout clause the printed data is incorrect. Is the use of copyout correct?

Thank you for your help!

main.f90

 program main
        use openacc
        implicit none
        integer, parameter :: n = 5
        integer,  numdevices
        integer,  i
        real*8 vec(n)


        numdevices = acc_get_num_devices(acc_device_nvidia)
        if(numdevices.ne.0)then
         call acc_set_device_num(mod(0,numdevices),acc_device_nvidia)
        endif

        vec = 0.0d0

        ! main loop
        !$acc data copyin(vec)
        do i=1,10
           call solver(i, n, vec)
           write(*,*)vec
        enddo
        !$acc end data

      end program main

solver.f90

subroutine solver(it, n, vec)
        use MyParams
        implicit none
        integer it, i, j, n
        real*8, vec(n)

        ! allocate the memory
        call alloc_data(n,VecData)
        ! generate new data
        call gen_data(it * 1.0d0,VecData)
        write(*,*)"Solver call: ",it

        !$acc data copyin(VecData, VecData%vec, VecData%vec%vec1),&
        !$acc copyout(vec)
        do i=1,1
          !$acc kernels present(vec, VecData%vec%vec1)
          do j=1,10
             vec(j) = VecData%vec%vec1(j)
          enddo
          !$acc end kernels
        enddo
        !!$acc update host(vec)
        !$acc end data

        ! free all the data
        call free_data(VecData)

      end subroutine solver

my_type.f90

module MyData
       implicit none
       type vecs
          real*8, allocatable :: vec1(:)
       end type vecs

       type Dtype
          type(vecs)        :: vec
       end type Dtype

      contains
        subroutine alloc_data(x, vdata)
          integer x
          type(Dtype) :: vdata
          allocate( vdata%vec%vec1(x) )
        end subroutine alloc_data

        subroutine gen_data(d, vdata)
          real*8 d
          type(Dtype) :: vdata
          vdata%vec%vec1 = d
        end subroutine gen_data

        subroutine free_data(vdata)
          type(Dtype) :: vdata
          deallocate(vdata%vec%vec1)
        end subroutine free_data

      end module MyData

      module MyParams
        use MyData
        implicit none
        type(Dtype) :: VecData
      end module MyParams

makefile

all:
        mpif90 -acc -Mcuda -Minfo -ta=tesla my_type.f90 main.f90 solver.f90

clean:
        rm *.o *.mod

Hi Peter85,

Is my understanding correct that in the first call to the solver routine memory is allocated on the device and the data is copied from the host to the device. But in the second and later calls no new memory is allocated but the new data generated from gen_data(…) is copied to the device?

In this case “VecData” would be allocated and copied to the device each time it enters the data region and then freed upon exit of the region. As an optimization, the compiler may not actually free the device memory and instead re-use it, but it would still be re-associated with newly allocated variable and copied to the device each time Solver is called.

Until now I used the update host(…) directive but I want to make it more visible at the beginning of the data region which data is updated so I inserted a copyout clause. But with the copyout clause the printed data is incorrect. Is the use of copyout correct?

No, you’ll want to use the update directive with vec.

The difference between the two cases is that “vec” is already in another data region (i.e. it’s in a nested data region). OpenACC data clauses have a concept of “present or copy”, meaning that if the variable is already present on the device, no action is taken. So here since “vec” is already present on the device, the “copyout(vec)” is basically ignored and you need to explicitly update it.

If you have a scenario where you have multiple calls to solver and the “vec” being passed may or may not be present, you have two options.

  1. Keep the copyin(vec) as you have it now but move the update just after the call to solver. Something like:

in main.f90:

        !$acc data copyin(vec)
        do i=1,10
           call solver(i, n, vec)
          !$acc update host(vec)
           write(*,*)vec
        enddo
        !$acc end data

If Solver is called with a vec not present, it will be copied to the device in solver. Otherwise, vec is only managed from the outer data regions.

  1. Use both copyin and update in Solver, but add the “if_present” clause to update.
        !$acc data copyin(VecData, VecData%vec, VecData%vec%vec1),&
        !$acc copyout(vec)
        do i=1,1
          !$acc kernels present(vec, VecData%vec%vec1)
          do j=1,10
             vec(j) = VecData%vec%vec1(j)
          enddo
          !$acc end kernels
        enddo
        !$acc end data
        !$acc update host(vec) if_present

In this scenario, if “vec” is already present, the copyout will be ignored but “vec” is updated. If it’s not present, the update is ignored but the “copyout” is used.

Personally, I prefer the first option since I like to manage the data movement at the same level as the outer data region.

Hope this helps,
Mat

Thank you for your detailed answer!

Regarding my first question. Because I allocate and free the host memory in each solver call the data is moved every time to the GPU? If I want to avoid this movement I would need to move the memory allocation and de-allocation outside of the solver before and after the main loop and use the OpenACC directives for unstructured data regions to move the data to the GPU? (enter data and exist data)

Regarding my second question. It is still not clear to me why the data is not updated on the host by using copyout.
In a simple case, in a non-nested data region update host (if used at the end of the region) and copyout have the same effect, is this correct?
But in case of the nested data region, since vec is already present on the device copyout will not allocate new memory on the device, instead it will use the existing buffer. At the same time it will increment the reference count. At the end of the data region, it will decrement the reference count, but the reference count is not zero because the outer most data region still references the buffer. So the reference count is not zero and the data is not updated on the host, is this correct?

Thank you for your help.

Because I allocate and free the host memory in each solver call the data is moved every time to the GPU?

Correct. The mirrored copy of the device array is associated with the address of the host array. So if you reallocate on the host, the host address and possibly the size, will be different. So each time you allocate/deallocate the array on the host, you must do this for the device copy as well.

If I want to avoid this movement I would need to move the memory allocation and de-allocation outside of the solver before and after the main loop and use the OpenACC directives for unstructured data regions to move the data to the GPU? (enter data and exist data)

Correct.

In a simple case, in a non-nested data region update host (if used at the end of the region) and copyout have the same effect, is this correct?

Assuming the update is within the scope of the data region, this would have the effect of copying the data twice. Once at the update and once once the data region is exited.

So the reference count is not zero and the data is not updated on the host, is this correct?

Yes, you can think of it this way. I typically describe it as “present_or” semantics, meaning if the data is already present (i.e. the reference count >1), then the copy clause is ignored.

A little history here. The original OpenACC standard had two different groups of copy clauses. The first being “copy”, “copyin”, “copyout”, and “create” where the create and optionally copy would occur but would error if the data was already present. The second was the “present_or” versions, “present_or_copy” (aka “pcopy”), “present_or_copyin” (pcopyin), “present_or_copyout” (pcopyout), “present_or_create” (pcreate), where the action would only occur if the data was not present. The main reason for this was a concern about the overhead of the present check and they didn’t want to force extra overhead if at all possible. Though it turned out this fear was unwarranted as the overhead was minimal.

The first version became problematic since it meant that users needed to know if the data was present or not. So if in your example solver was called in one spot where “vec” was already present, but another where it’s not, using “copyout” would cause the program to error when it’s present. Most users opted to use the “present_or” versions. Plus it made it confusing to have two separate sets of copy clauses. So in the OpenACC 2.5 standard the two versions were merged where “copy” became an alias for “pcopy”.

At one point I advocated for “present_and” semantics where if the data was present it would not be reallocated on the device but the copy would still occur. (I’m assuming that this is your confusion since this is how you most likely think a copyout clause behaves) But this was not adopted since A) adding more ways the copy clause behaves would cause even more confusion, and B) the “update” directive addresses this issue.

One recommendation, and how I tend to program today, is that I typically don’t use structured data regions anymore (i.e. !$acc data / !$acc end data) and instead use unstructured with only a create/delete clause (i.e. !$acc enter data create(foo) / !$acc exit data delete (foo) ) and then use “update” directives to perform the data movement. Yes, its a bit more typing but by separating them it’s more clear as to when the data is actually being created and when it’s being copied.

Hope this helps,
Mat

Thanks for your detailed answer! Yes, I think unstructured data regions will make my source code more easier to read and understand.