I was experimenting with the Fortran pointer feature in OpenACC and got some confusing result. The issue can be demonstrated by the following piece of code:
! pointer_test.f90
program main
implicit none
integer it, na
real, allocatable, target :: w(:, :)
real, pointer :: wp(:, :)
real aa
na = 256
allocate(w(na, na))
!$acc enter data create(w)
do it = 1, 10
wp => w
!$acc data present(wp)
!$acc kernels
wp = 0.
!$acc end kernels
aa = 2. * it
!$acc kernels
wp = aa
!$acc end kernels
!$acc end data
enddo
do it = 1, 10
wp => w
!$acc kernels present(wp)
wp = 0.
!$acc end kernels
aa = 2. * it
!$acc kernels present(wp)
wp = 1.
!$acc end kernels
enddo
!$acc exit data delete(w)
end program
The loop is actually a time-stepping loop. I need to launch several kernels among each iteration, so I put these kernels into a structured data region (The first loop). According to this post https://forums.developer.nvidia.com/t/openacc-fortran-pointer-how-to-question/165949, I can access the device copy of “w”via pointer by adding the pointer “wp”to the present clause.
I compile this program by:
nvfortran -acc -Minfo -r8 -O2 pointer_test.f90
The result is fine, but the profiling result shows that the data construct incurs some HtoD data movements at the beginning of every iteration, which causes some performance degradation. It looks like the pointer itself was kept copying from host to device.
One workaround is moving the present clause to the implicit data region of each kernel instead (The second loop). The profiling result shows no data movement anymore.
I am wondering what is the difference between the structured data construct and the implicit data construct in this situation? Do I miss something about the structured data construct?
My HPC SDK version is 21.3. Attached is the profiling result got by Nsight-sys-2021.1.1.66.
Best regards