Unstructured copyin vs create + update

Dear all,
In converting subroutine in a large fortran code to use OpenACC, I encountered behaviour that I do not understand. As the subroutine was large I boiled it down to the following routine:

subroutine inner_sub(flux)
use f90_kind
implicit none

real(dp), dimension(:), intent(inout) :: flux

integer :: dir
integer :: elem_no

print *,‘flux on host’,flux(1:4)

! This does not appear to actually copy the content of flux to target
!!$acc enter data copyin(flux)

! These two lines instead work well
!$acc enter data create(flux)
!$acc update device(flux)

!$acc parallel default(none) present(flux)
!$acc loop independent private(elem_no,dir)
do elem_no=1,2
do dir=1,8
if ((elem_no==1).and.(dir==1)) then
print *,‘FLUX on device’
print *,‘flux’,flux(1),flux(2),flux(3),flux(4)
endif
enddo
enddo
!$acc end loop
!$acc end parallel

!$acc exit data delete(flux)

end subroutine inner_sub

The routine is embedded in a module.

I cannot give the calling code as it isa beast of a coded too difficult to trim down.

The code itself is a bit nonsense but it will do as example.
I am using unstructured data as that is where I need to go with the code anyway and I want in this state to just test the important subroutines if they work well under OpenACC.

So the problem is that the copying of flux does not appear to work. The flux values inside the loop are always 0.

If I replace the copyin by create + update commands, then it works as expected.

Is this behaviour expected? If so, can someone please explain?

PS. If I use this subroutine in a very simple code then it actually does work with copyin.

Thanks,
Dan.

Hi Dan,

Is “flux” in a higher level data region?

The “copy” clauses use “present_or” semantics meaning if the array is already on the device (i.e. “present”), no copy or create is done. So if it is in a higher level data region, the behavior make sense.

If I replace the copyin by create + update commands, then it works as expected.

The create is likely being ignored as well, but the explicit update is synchronizing the data.

Typically I recommend a “top-down” approach to data management. Meaning use “enter data create” just after you allocate the arrays and “exit data delete” just before deallocation. This makes it so the lifetime and scope of the device copy of the array matches the host.

Then use the “update” directive to synchronize the two copies giving you control on when the data movement occurs.

-Mat

Hi Mat,
There is no higher level data regio, but the code is called multiple times. The first time it is called flux=0, so no difference would be evident anyway. Not sure if this explains anything. I put the delete statement at the end for that. Do things then still make sense?

Did things locally just to get started with openacc-ing the routines without touching others.

I will follow your rule-of-thumb from now on.

Hmm, the only other thing would be if you had another subroutine with an enter data region without a corresponding exit, then that stack address would still be in the present table. Though that typically results in partially present error if the sizes are different. If it happens that they are the same size then it’s a possibilty.

What you can do is add a “use openacc”, then call “acc_present_dump()” before and after the “enter data copyin” directive.

This dumps the present table so we can see if the host address associated with flux is in the present table or not.

Another helpful debugging technique is to set then environment variable “NV_ACC_NOTIFY=2” which will show the data transfers.

ACC_NOTIFY has the following settings:

  • 1: kernel launches
  • 2: data transfers
  • 4: wait operations or synchronizations with the device
  • 8: region entry/exit
  • 16: data allocate/free

It’s a bit-mask, so the values can be combined, i.e. “3” would be both kernel launches and data movement.

Hi Mat,

Thanks so far.
I need some time to dive deeper given your hints. Will get back to this in a few days.

Dan.

Before diving further into the issue I did see something during debugging that I do not understand.

I have NV_ACC_NOTIFY set to 2

Before ANY acc statements are executed I call acc_present_dump() in main. Note that there are also no statements in modules at all to make variables global.

I get:

Present table dump
…uninitialized…
upload CUDA data file=/home/dlathouwers/BitBucket_codes/bug_no_copyin/src/main.f90 function=phantom_sn line=130 device=0 threadid=1 variable=_mat_comp_21 bytes=432
upload CUDA data file=/home/dlathouwers/BitBucket_codes/bug_no_copyin/src/main.f90 function=phantom_sn line=130 device=0 threadid=1 variable=_sn_data_21 bytes=184

So it somehow uploads these to the target, but I do not ask for it !
mat_comp and SN_data (without the underscore) are modules in the code.

Also the line number mentioned corresponds to the one that contains $acc enter data create flux that is in main just after its allocation (as recommended).

AFTER the create line I make the call again and get the same + flux has been uploaded (as expected).

Why is it doing this? Any ideas?

The under bar 21 references a module variable. Are you using a “declare” directive?

If not, then these module variables are likely getting implicitly copied when entering a compute region. If you add “-Minfo=accel” you can see the compiler feedback messages where it will tell you if it’s implicitly copying data.

Would you be able to share the full program? It might easier for me to see the full code.

-Mat

There is no declare.

The only routines that have acc directives is main (for create(flux)) and a subroutine containing the loop. The subroutine does not contain mat_comp nor SN_data.

I need quite a bit of time to reduce the code down to the essentials only. Will keep you posted for an example worthy of sending in.

Danny.

“flux” is in a higher level data region? If so, then that’s the most likely reason.

Here’s your small example with a main with an outer data region. Can you modify to replicate what the full program is doing? Might be easier than pairing down your full code.

module foo
contains
subroutine inner_sub(flux)
implicit none

real(8), dimension(:), intent(inout) :: flux

integer :: dir
integer :: elem_no

print *,"flux on host",flux(1:4)

!$acc enter data copyin(flux)
!$acc parallel default(none) present(flux)
!$acc loop independent private(elem_no,dir)
do elem_no=1,2
do dir=1,8
if ((elem_no==1).and.(dir==1)) then
print *,"FLUX on device"
print *,"flux",flux(1),flux(2),flux(3),flux(4)
endif
enddo
enddo
!$acc end loop
!$acc end parallel
!$acc exit data delete(flux)

end subroutine inner_sub
end module foo

program test
use foo
real(8), dimension(:), allocatable :: flux
allocate(flux(4))
!$acc enter data create(flux)
flux=1
call inner_sub(flux)
!$acc exit data delete(flux)
deallocate(flux)

end program test

Just to keep this thread alive.
Will post an example later this weekend.

Dan.

So I had the time to investigate further. As usual, it all turned out to be a giant ghost hunt.

The situation is as below. We have a main with an array that is created and updated on the gpu.
Then a subroutine group is called that in itself calls a subroutine multiple times in a loop.
That routine is a reverse communication solver, where various actions are requested from the calling routine (group). Inside the rev com routine pointers are attached to arrays to make life ‘easer’.
The requested actions are performed on the pointers (v and w).

The problem I faced and misunderstood was that in some steps the pointers were pointing to arrays that were not on the device.

What also did not help in finding this problem is that the name flux is also used in the argument list of some routines (e.g. do_something as below).

Anyway. The situation is clear and solved. Just a lot of confusion going on on my side.

program
real, allocatable, dimension(:) :: flux

!$acc enter data create(flux)
!$acc update device(flux)

call group(flux)

end program

subroutine group(flux)
real :: flux

integer :: task
real, pointer :: v(:),w(:)

do
call revcom(v,w,flux,task)

if (task==1) then
call do_something(v,w)
endif

! other tasks or exit of loop

enddo

end subroutine group

subroutine do_something(flux)
real :: flux
! …
end subroutine do_something