I was experimenting with the Fortran pointer feature in OpenACC and got some confusing result. The issue can be demonstrated by the following piece of code:
! pointer_test.f90 program main implicit none integer it, na real, allocatable, target :: w(:, :) real, pointer :: wp(:, :) real aa na = 256 allocate(w(na, na)) !$acc enter data create(w) do it = 1, 10 wp => w !$acc data present(wp) !$acc kernels wp = 0. !$acc end kernels aa = 2. * it !$acc kernels wp = aa !$acc end kernels !$acc end data enddo do it = 1, 10 wp => w !$acc kernels present(wp) wp = 0. !$acc end kernels aa = 2. * it !$acc kernels present(wp) wp = 1. !$acc end kernels enddo !$acc exit data delete(w) end program
The loop is actually a time-stepping loop. I need to launch several kernels among each iteration, so I put these kernels into a structured data region (The first loop). According to this post https://forums.developer.nvidia.com/t/openacc-fortran-pointer-how-to-question/165949, I can access the device copy of “w”via pointer by adding the pointer “wp”to the present clause.
I compile this program by:
nvfortran -acc -Minfo -r8 -O2 pointer_test.f90
The result is fine, but the profiling result shows that the data construct incurs some HtoD data movements at the beginning of every iteration, which causes some performance degradation. It looks like the pointer itself was kept copying from host to device.
One workaround is moving the present clause to the implicit data region of each kernel instead (The second loop). The profiling result shows no data movement anymore.
I am wondering what is the difference between the structured data construct and the implicit data construct in this situation? Do I miss something about the structured data construct?
My HPC SDK version is 21.3. Attached is the profiling result got by Nsight-sys-2021.1.1.66.