Some troubles with kernel generation in OpenACC

Hi,

I have some troubles with my program. The structure of my code is the following(in fortran):

main function
.
call function1(input arrays)
.
end main

function1(input arrays)
.
!$acc data copy(input and output arrays) , present_or_create(internal arrays)
!$acc kernels

!Then follow about 5 or 6 1D loops
!$acc loop
do i=1.etc
.
.
.

!$acc loop
do i=1.etc
...

!And here is the problem.The first 2D loop
!$acc loop independent gang
  do i=1,N
!$acc loop independent gang vector
    do j=1,M
 independent calculations..
   enddo
  enddo


!and then again follow 1D loops
!$acc loop
do i=1.etc
.
.
.

!$acc end kernels
!$acc end data
end function1

My problem is that when i compile my code, i get the correct parallelization for all the 1D loops,
for example :
108, Loop is parallelizable
Accelerator kernel generated
108, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

but the compiler for the 2D loop gives me the same parallelization

for example i expect some thing like:
57, Loop is parallelizable <–REFERS TO I
59, Loop is parallelizable ← REFERS TO J
Accelerator kernel generated
57, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
59, !$acc loop gang ! blockidx%y
CC 1.3 : 35 registers; 100 shared, 8 constant, 0 local memory bytes
CC 2.0 : 38 registers; 0 shared, 156 constant, 0 local memory bytes

but i get
189, Loop is parallelizable <–REFERS TO I
Accelerator kernel generated
189, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
CC 1.3 : 30 registers; 112 shared, 56 constant, 0 local memory bytes
CC 2.0 : 41 registers; 0 shared, 276 constant, 0 local memory bytes
191, Loop is parallelizable ← REFERS TO J

And when i finally take my time analysis for my program, i see that this 2D loop is not in parallel but sequantial.

Now the strange part. If a use OpenACC only to the part with the 2D loop,
for example:

function1(input arrays)
.
do i=1.etc
.
.
.


!$acc data copy(input and output arrays) , present_or_create(internal arrays)
!$acc kernels
!$acc loop independent
  do i=1,N
!$acc loop independent
    do j=1,M
 independent calculations..
   enddo
  enddo
!$acc end kernels
!$acc end data

do i=1.etc
.
.
.
end function1

I get the correct parallelization. I can’t figure out what is going.

Thanks, Sotiris

Hi Sotiris,

Sorry, I can’t tell what the issue is from what you posted. Can you please post or send to PGI Customer Service (trs@pgroup.com) a reproducing example with illustrates the issue.

Thanks,
Mat

Hi Mat and thank you for the help you provide us,

I will post an example of that issue, but let me ask you something else first(an different version of my code). Consider that i use the first case(all parts in parallel):

main function 
. 
call function1(input arrays) 
. 
end main 

function1(input arrays) 
. 
!$acc data copy(input and output arrays) , present_or_create(internal arrays) 
!$acc kernels 

!Then follow about 5 or 6 1D loops 
!$acc loop 
do i=1.etc 
. 
. 
. 

!$acc loop 
do i=1.etc 
... 

!And here is the problem.The first 2D loop 
!$acc loop independent gang 
  do i=1,N 
!$acc loop independent gang vector 
    do j=1,M 
 independent calculations.. 
   enddo 
  enddo


!and then again follow 1D loops 
!$acc loop 
do i=1.etc 
. 
. 
. 

!$acc end kernels 
!$acc end data 
end function1

but at the point where the double loop is placed, i use another function(function2(input arrays)) with the form


function2

!$acc kernels
!$acc loop independent gang 
  do i=1,N 
!$acc loop independent gang vector 
    do j=1,M 
 independent calculations.. 
   enddo 
  enddo 
!$acc end kernels

and inside my data region i call function2

call function2(input_arrays)

In that case compiler gives me the CORRECT parallelization (2D grid) BUT i have now another problem. Whenever function2 is called i have data movement from host to device and backwards, and of course I don’t need that because i’ve already placed my data at the beginning of function1. In this case my program is correct(correct results) but all the time spent in data movement.Is there a possible solution for that?

Thank you,
Sotiris

Mat forget my previous post, I solved the problem by adding

present_or_create(my arrays)

inside function2(). So i avoid data movement and i get correct results and my program works fine with the desirable parallelization and time exacution. For my first post and the situation there i will post the exactly problem in the near future to give you a general idea.

Thanks,
Sotiris

Hi Sotiris,

If I understand this correctly, you have a function (function1) which contains an OpenACC data region and a compute region. Near the 2d nest loop, you have a function call (to function2) where you want to use the same data from the local arrays (the ones in the outer data region’s create clause).

Since data regions can span across multiple compute regions as well as host code including calls, you can have access to function1’s device data from within function2 provided that function2 is called from within a data region that created the array and you use one of the “present” clauses to tell the compiler that the data is already on the device.

  • Mat

Hi Mat

If I understand this correctly, you have a function (function1) which contains an OpenACC data region and a compute region.

That’s correct.

Near the 2d nest loop, you have a function call (to function2) where you want to use the same data from the local arrays (the ones in the outer data region’s create clause).

The 2D nest loop is actually inside function2. And yes, i want to use the local arrays, which i “created” in the data region’s create clause.

Since data regions can span across multiple compute regions as well as host code including calls, you can have access to function1’s device data from within function2 provided that function2 is called from within a data region that created the array and you use one of the “present” clauses to tell the compiler that the data is already on the device.

I did a time analysis for my program with Nvidia visual profiler and i came to some conclusions.

-First, i’ve noticed that using present_or_create clause, some time spent on data memory allocation of the local arrays whenever my program get’s into function2. There is no data movement between host and device, but there is only memory allocation in the device which cause time penalty. Of course this time penalty is due to “create” clause and it is smaller than the penalty from the data movement between host and device. So my program “runs” faster than before, but every time i’m reaching function2, i’m having a small time penalty.
-Second, if i use ony the “present” clause, i get an error from the compiler:

FATAL ERROR: data in PRESENT clause was not found: name=myarray

so the compiler can’t find the local arrays when i get inside function2. (Notice that i’m not passing this arrays as arguments of function2.) Why is that happening?

so the compiler can’t find the local arrays when i get inside function2. (Notice that i’m not passing this arrays as arguments of function2.) Why is that happening?

Without a real example I can’t tell. This just indicates that myarray can’t be found on the device and therefore not included in a data region within the same lifetime.

Note that the host and device copies of an array are not name associated. Hence, if you have a local array in one routine and another having the same name in a second routine, they are not the same array. The array used in a present clause must lie within the same host address range as the array in the data region clause. Actually, the names can be completely different or one can be a subset of the other.

  • Mat