Question about data movement as seen from compiler feedback

Hello,

I have a subroutine that contains a data region and two parallel regions inside the data region, as below:

subroutine sub1

!$acc data copyin(a,b,c)
!$acc parallel

!$acc end parallel

!$acc parallel

!$acc end parallel
!$acc end data

end subroutine sub1

When I compile the code, the compiler outputs
generating copyin(a)
generating copyin(b)
generating copyin(c)
at three places, where the data region and the two parallel regions start. Does that mean the code will copy in a, b, and c at those three places. I know it shouldn’t be this case. But how to explain the message?

Thanks,

Ping

Hi Ping,

Are you sure they are all copyins or are two present_or_copyins? There will be a present before each parallel region in order to allow for pointer swapping.

  • Mat

Hi Mat,

Thanks for the quick response.

I saw exactly three copyins for every variable in the list, one at the beginning of the data region and two at the beginning of the two parallel regions. I didn’t specify present(a,b,c) at the parallel regions. Would that caused the problems?

Ping

Hi Mat,

Adding present(a, b, c) solved the problem. Thanks.


Ping

Hi Mat,

I still have some doubts.

First, the OpenACC standard says an array referenced in the kernels or parallel construct that doesn’t appear in a data clause for the construct or any enclosing data construct will be treated as if it appeared in a present_or_copy for the construct. This means my original code should generate two present_or_copy, right?

Second, after adding the present clause, now the compiler shows present_or_copyin. But should it be present only?

Third, in terms of overhead, does present_or_copyin poses much overhead than present?

Thanks,

Ping

Hi Ping,

Could you post an example of the Minfo output as well as a reproducing example? This will help me answer your first two questions.

For the third, yes there is some overhead in performing the present look-up, but is fairly small.

  • Mat

Hi Mat,

Here is an example.

========= Begin program ==========
module mod1
real*8, allocatable :: a(:,:), b(:,:), c(:,:)
end module mod1

program prog1
use mod1

allocate(a(100,100),b(100,100),c(100,100))
c=0.0d0
a=1.13240d0
b=2.33413d0

call sub1

end program prog1

subroutine sub1
use mod1
integer i,j,k
!$acc data copyin(a,b) copy(c)

!$acc kernels loop present(a, b, c)
do j=1,100
do i=1,100
do k=1,100
c(i,j) = c(i,j)+a(i,k)*b(k,j)
enddo
enddo
enddo

!$acc end kernels

!$acc end data

end subroutine sub1
=========End of program===============
======== Begin compiler output ===========
pgfortran -acc -Minfo main.f90
prog1:
9, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
10, Memory set idiom, array assignment replaced by call to pgf90_mset8
11, Memory set idiom, array assignment replaced by call to pgf90_mset8
sub1:
20, Generating copyin(b(:,:))
Generating copyin(a(:,:))
Generating copy(c(:,:))
22, Generating present_or_copy(c(:,:))
Generating present_or_copyin(b(:,:))
Generating present_or_copyin(a(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
23, Loop is parallelizable
24, Loop is parallelizable
25, Complex loop carried dependence of ‘c’ prevents parallelization
Loop carried dependence of ‘c’ prevents parallelization
Loop carried backward dependence of ‘c’ prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
23, !$acc loop gang ! blockidx%y
24, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
25, CC 1.3 : 17 registers; 136 shared, 4 constant, 0 local memory bytes
CC 2.0 : 33 registers; 0 shared, 152 constant, 0 local memory bytes
==========End compiler output==================

If I delete present(a, b, c) from the parallel construct, the output from the compiler is as follow

sub1:
20, Generating copyin(b(:,:))
Generating copyin(a(:,:))
Generating copy(c(:,:))
22, Generating copy(c(:,:))
Generating copyin(a(:,:))
Generating copyin(b(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary


Thanks,

Ping

Hi Ping,

You must be using an older version of the compiler. The Minfo messages originally hadn’t been updated to reflect the “present_or_copy…” change that occurred in the 12.6 release. This was corrected in the 12.9 release.

Here’s the output from 12.8 and 12.9:

% pgf90 -acc -Minfo test2.f90 -V12.8
prog1:
      9, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
     10, Memory set idiom, array assignment replaced by call to pgf90_mset8
     11, Memory set idiom, array assignment replaced by call to pgf90_mset8
sub1:
     20, Generating copyin(b(:,:))
         Generating copyin(a(:,:))
         Generating copy(c(:,:))
     22, Generating copy(c(:,:))
         Generating copyin(a(:,:))
         Generating copyin(b(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     24, Loop is parallelizable
     25, Loop is parallelizable
     26, Complex loop carried dependence of 'c' prevents parallelization
         Loop carried dependence of 'c' prevents parallelization
         Loop carried backward dependence of 'c' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         24, !$acc loop gang ! blockidx%y
         25, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         26, CC 1.3 : 17 registers; 128 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 33 registers; 0 shared, 144 constant, 0 local memory bytes

p% pgf90 -acc -Minfo test2.f90 -V12.9
prog1:
      9, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
     10, Memory set idiom, array assignment replaced by call to pgf90_mset8
     11, Memory set idiom, array assignment replaced by call to pgf90_mset8
sub1:
     20, Generating copyin(b(:,:))
         Generating copyin(a(:,:))
         Generating copy(c(:,:))
     22, Generating present_or_copy(c(:,:))
         Generating present_or_copyin(a(:,:))
         Generating present_or_copyin(b(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     24, Loop is parallelizable
     25, Loop is parallelizable
     26, Complex loop carried dependence of 'c' prevents parallelization
         Loop carried dependence of 'c' prevents parallelization
         Loop carried backward dependence of 'c' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         24, !$acc loop gang ! blockidx%y
         25, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         26, CC 1.3 : 17 registers; 112 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 42 registers; 0 shared, 128 constant, 0 local memory bytes

Sorry for the confusion,
Mat