Openacc keeps on transferring data between host and device even after declaring !$acc data default (present)

Dear developers,

I am encountering a very weird situation. I have a massive scientific code that I want to parallelise using Openacc. My strategy is straightforward. I would love to put all arrays to the GPU by using !$ACC DATA COPYIN at the very beginning and do every computation on the GPU. Here is the lightweight version of the massive code


PROGRAM OFFLOAD
USE OMP_LIB
USE DEFINITION
IMPLICIT NONE

! integer !
integer :: j, k, l, i, n

! Check timing with or without openmp
INTEGER :: time_start, time_end
INTEGER :: cr, cm
REAL*8 :: rate

!!!

CALL OMP_SET_NUM_THREADS(64)

!!!

CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = REAL(cr)

!!!

CALL INITIAL

!!!

!$acc data copyin(cons, prim, flux)

!!!

CALL system_clock(time_start)

DO n = 1, 100
WRITE (,) n
CALL UtoF
END DO

CALL system_clock(time_end)
WRITE(,) 'Preparation = ', REAL(time_end - time_start) / rate

!!!

!$acc end data

!!!

! check answer !
WRITE (,) prim(1,2,3,4)

!!!

END PROGRAM


And


SUBROUTINE INITIAL
USE OMP_LIB
USE DEFINITION
IMPLICIT NONE

! integer !
integer :: j, k, l

DO j = -2, nx_2 + 3
DO k = -2, ny_2 + 3
DO l = -2, nz_2 + 3
cons(imin2:imax2,j,k,l) = 3.0d0
prim(imin2:imax2,j,k,l) = 4.0d0
END DO
END DO
END DO

END SUBROUTINE


And


SUBROUTINE UtoF
USE OMP_LIB
USE DEFINITION
IMPLICIT NONE

! integer !
integer :: j, k, l

!$acc data present(cons, flux, prim)

!$OMP PARALLEL DO COLLAPSE(3) SCHEDULE(STATIC)
!$acc parallel loop gang
DO j = -2, nx_2 + 3
!acc loop worker
DO k = -2, ny_2 + 3
!$acc loop vector
DO l = -2, nz_2 + 3
flux(imin2:imax2,j,k,l) = cons(imin2:imax2,j,k,l)**2 + prim(imin2:imax2,j,k,l)
END DO
END DO
END DO
!$acc end parallel
!$OMP END PARALLEL DO

!$acc end data

END SUBROUTINE


Here is the thing. I expect no data transfer between the host and device once I declare !$ACC DATA COPYIN in the main function. But then when I profile my program using nvprof, I saw data transfer between host and device, exactly at the beginning and the end of subroutine UtoF, where I explicitly declared default(present).

Is there anyway that I can bypass this unwanted data transfer?

Thanks!

Hi HydroHLLCFV,

Since the code is incomplete, I can’t be sure, but my best guess is that it might be the array descriptors getting updated.

To determine exactly what’s getting updated, set the environment variable “NV_ACC_NOTIFY=2”. This has the runtime print a line each time data is transferred but unlike nvprof/nsys, will give the variable name.

Note that ACC_NOTIFY has the following settings:

  • 1: kernel launches
  • 2: data transfers
  • 4: wait operations or synchronizations with the device
  • 8: region entry/exit
  • 16: data allocate/free

It’s a bit-mask, so the values can be combined, i.e. “3” would be both kernel launches and data movement.

-Mat

Hi Mat,

Thank you for your informative reply. After using NV_ACC_NOTIFY=2 I find that my data transfer management works completely fine, and thus the code get slows because of something else.

Thanks a lot!