OpenACC Fortran pointer in the structured data construct

I was experimenting with the Fortran pointer feature in OpenACC and got some confusing result. The issue can be demonstrated by the following piece of code:

 ! pointer_test.f90
 program main
   implicit none
   integer it, na
   real, allocatable, target :: w(:, :)
   real, pointer :: wp(:, :)
   real aa
   na = 256
   allocate(w(na, na))
   !$acc enter data create(w)

   do it = 1, 10
     wp => w
     !$acc data present(wp)
     !$acc kernels
     wp = 0.
     !$acc end kernels

     aa = 2. * it
     !$acc kernels
     wp = aa 
     !$acc end kernels
     !$acc end data
   enddo
  

   do it = 1, 10
     wp => w
     !$acc kernels present(wp)
     wp = 0.
     !$acc end kernels

     aa = 2. * it
     !$acc kernels present(wp)
     wp = 1.
     !$acc end kernels
   enddo
   !$acc exit data delete(w)
 end program

The loop is actually a time-stepping loop. I need to launch several kernels among each iteration, so I put these kernels into a structured data region (The first loop). According to this post https://forums.developer.nvidia.com/t/openacc-fortran-pointer-how-to-question/165949, I can access the device copy of “w”via pointer by adding the pointer “wp”to the present clause.
I compile this program by:
nvfortran -acc -Minfo -r8 -O2 pointer_test.f90
The result is fine, but the profiling result shows that the data construct incurs some HtoD data movements at the beginning of every iteration, which causes some performance degradation. It looks like the pointer itself was kept copying from host to device.

One workaround is moving the present clause to the implicit data region of each kernel instead (The second loop). The profiling result shows no data movement anymore.

I am wondering what is the difference between the structured data construct and the implicit data construct in this situation? Do I miss something about the structured data construct?

My HPC SDK version is 21.3. Attached is the profiling result got by Nsight-sys-2021.1.1.66.


Best regards

Hi JieyunPan,

What’s happening is that the runtime is updating the Fortran descriptor for the “wp” pointer when entering a data region. Though I believe we can optimize this away when the present is on the kernels region, but would need to ask our engineers if you need a more specific answer.

-Mat

Thanks, Mat. I will stick to the second method after knowing that it’s a safe optimization. It may be hard for me to understand more low-level cause, so I won’t need a more specific answer.

Best regards
Jieyun