Question about data movement as seen from compiler feedback

appleluo · January 23, 2013, 6:32pm

Hello,

I have a subroutine that contains a data region and two parallel regions inside the data region, as below:

subroutine sub1
…
!$acc data copyin(a,b,c)
!$acc parallel
…
!$acc end parallel
…
!$acc parallel
…
!$acc end parallel
!$acc end data

end subroutine sub1

When I compile the code, the compiler outputs
generating copyin(a)
generating copyin(b)
generating copyin(c)
at three places, where the data region and the two parallel regions start. Does that mean the code will copy in a, b, and c at those three places. I know it shouldn’t be this case. But how to explain the message?

Thanks,

Ping

MatColgrove · January 23, 2013, 8:06pm

Hi Ping,

Are you sure they are all copyins or are two present_or_copyins? There will be a present before each parallel region in order to allow for pointer swapping.

Mat

appleluo · January 23, 2013, 8:34pm

Hi Mat,

Thanks for the quick response.

I saw exactly three copyins for every variable in the list, one at the beginning of the data region and two at the beginning of the two parallel regions. I didn’t specify present(a,b,c) at the parallel regions. Would that caused the problems?

Ping

appleluo · January 23, 2013, 8:43pm

Hi Mat,

Adding present(a, b, c) solved the problem. Thanks.

Ping

appleluo · January 24, 2013, 4:42pm

Hi Mat,

I still have some doubts.

First, the OpenACC standard says an array referenced in the kernels or parallel construct that doesn’t appear in a data clause for the construct or any enclosing data construct will be treated as if it appeared in a present_or_copy for the construct. This means my original code should generate two present_or_copy, right?

Second, after adding the present clause, now the compiler shows present_or_copyin. But should it be present only?

Third, in terms of overhead, does present_or_copyin poses much overhead than present?

Thanks,

Ping

MatColgrove · January 24, 2013, 9:24pm

Hi Ping,

Could you post an example of the Minfo output as well as a reproducing example? This will help me answer your first two questions.

For the third, yes there is some overhead in performing the present look-up, but is fairly small.

Mat

appleluo · January 25, 2013, 4:24pm

Hi Mat,

Here is an example.

========= Begin program ==========
module mod1
real*8, allocatable :: a(:,:), b(:,:), c(:,:)
end module mod1

program prog1
use mod1

allocate(a(100,100),b(100,100),c(100,100))
c=0.0d0
a=1.13240d0
b=2.33413d0

call sub1

end program prog1

subroutine sub1
use mod1
integer i,j,k
!$acc data copyin(a,b) copy(c)

!$acc kernels loop present(a, b, c)
do j=1,100
do i=1,100
do k=1,100
c(i,j) = c(i,j)+a(i,k)*b(k,j)
enddo
enddo
enddo

!$acc end kernels

!$acc end data

end subroutine sub1
=========End of program===============
======== Begin compiler output ===========
pgfortran -acc -Minfo main.f90
prog1:
9, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
10, Memory set idiom, array assignment replaced by call to pgf90_mset8
11, Memory set idiom, array assignment replaced by call to pgf90_mset8
sub1:
20, Generating copyin(b(:,:))
Generating copyin(a(:,:))
Generating copy(c(:,:))
22, Generating present_or_copy(c(:,:))
Generating present_or_copyin(b(:,:))
Generating present_or_copyin(a(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
23, Loop is parallelizable
24, Loop is parallelizable
25, Complex loop carried dependence of ‘c’ prevents parallelization
Loop carried dependence of ‘c’ prevents parallelization
Loop carried backward dependence of ‘c’ prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
23, !$acc loop gang ! blockidx%y
24, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
25, CC 1.3 : 17 registers; 136 shared, 4 constant, 0 local memory bytes
CC 2.0 : 33 registers; 0 shared, 152 constant, 0 local memory bytes
==========End compiler output==================

If I delete present(a, b, c) from the parallel construct, the output from the compiler is as follow

sub1:
20, Generating copyin(b(:,:))
Generating copyin(a(:,:))
Generating copy(c(:,:))
22, Generating copy(c(:,:))
Generating copyin(a(:,:))
Generating copyin(b(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary

Thanks,

Ping

MatColgrove · January 28, 2013, 6:42pm

Hi Ping,

You must be using an older version of the compiler. The Minfo messages originally hadn’t been updated to reflect the “present_or_copy…” change that occurred in the 12.6 release. This was corrected in the 12.9 release.

Here’s the output from 12.8 and 12.9:

% pgf90 -acc -Minfo test2.f90 -V12.8
prog1:
      9, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
     10, Memory set idiom, array assignment replaced by call to pgf90_mset8
     11, Memory set idiom, array assignment replaced by call to pgf90_mset8
sub1:
     20, Generating copyin(b(:,:))
         Generating copyin(a(:,:))
         Generating copy(c(:,:))
     22, Generating copy(c(:,:))
         Generating copyin(a(:,:))
         Generating copyin(b(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     24, Loop is parallelizable
     25, Loop is parallelizable
     26, Complex loop carried dependence of 'c' prevents parallelization
         Loop carried dependence of 'c' prevents parallelization
         Loop carried backward dependence of 'c' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         24, !$acc loop gang ! blockidx%y
         25, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         26, CC 1.3 : 17 registers; 128 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 33 registers; 0 shared, 144 constant, 0 local memory bytes

p% pgf90 -acc -Minfo test2.f90 -V12.9
prog1:
      9, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
     10, Memory set idiom, array assignment replaced by call to pgf90_mset8
     11, Memory set idiom, array assignment replaced by call to pgf90_mset8
sub1:
     20, Generating copyin(b(:,:))
         Generating copyin(a(:,:))
         Generating copy(c(:,:))
     22, Generating present_or_copy(c(:,:))
         Generating present_or_copyin(a(:,:))
         Generating present_or_copyin(b(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     24, Loop is parallelizable
     25, Loop is parallelizable
     26, Complex loop carried dependence of 'c' prevents parallelization
         Loop carried dependence of 'c' prevents parallelization
         Loop carried backward dependence of 'c' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         24, !$acc loop gang ! blockidx%y
         25, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         26, CC 1.3 : 17 registers; 112 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 42 registers; 0 shared, 128 constant, 0 local memory bytes

Sorry for the confusion,
Mat

Topic		Replies	Views
Some troubles with kernel generation in OpenACC Legacy PGI Compilers	6	3822	January 29, 2013
Question regarding copyin and copyout Legacy PGI Compilers	4	4411	February 12, 2020
copyin behavior change in 12.5? Legacy PGI Compilers	4	4387	July 15, 2012
explain output Legacy PGI Compilers	8	4398	February 25, 2013
Loops seemed to have been parallelized but the result fail Legacy PGI Compilers	3	2710	October 16, 2012
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	10948	November 27, 2013
queston about data region Legacy PGI Compilers	3	1981	May 1, 2018
OpenACC: Problem with present directive and module array Legacy PGI Compilers	14	9240	August 14, 2012
questions about a program Legacy PGI Compilers	1	2285	July 31, 2015
function/procedure calls not supported Legacy PGI Compilers	5	7466	March 2, 2012

Question about data movement as seen from compiler feedback

Related topics