Data management with OpenMP5 and devices

Hi,
I’m working on a small test case to reproduce an error on a very large fortran code (> 1000 files) and I met some troubles understanding the basic data management with OpenMP5 directives. The code is in attachment.
array_m.F90 (18.3 KB)
Makefile (498 Bytes)

The first module array_m is managing a user defined type (r1_t) with a 1D array of double precision as attribute val. The goal is to manage data at this level creating a copy of the array on the device as it is allocated (line 161)

The second module velocity_m manages a user defined type vel2_t using 2 attributes of type r1_t. So the device storage of the arrays is managed by the r1_t type in module array_m.

The last module calcule_m has only one subroutine calculating an average in an offloaded loop. Interesting lines are:

    521              A=>AX%vel_new%val
    522              B=>BX%vel_new%val
    523              C=>CX%vel_new%val
    524 !$omp target update to(B,C)
    525 !$omp target
    526 !$omp parallel do private(i,j)
    527              do i=1, fin
    528                do j=1, 10
    529                 A(i)=0.5*(B(i)+C(i))
    530                enddo
    531              enddo
    532 !$omp end target
    533 !$omp target update from(A)

I do not understand why at run time, there is an update of B and C (line 524) and then B,C and A are also pushed on the device line 525 ?

Accelerator Kernel Timing data
/HA/sources/begou/SUB_ARRAY_OFFLOAD/TESTCASE_1/array_m.F90
  add_new_r1  NVIDIA  devicenum=0
    time(us): 74
    161: data region reached 6 times
        161: data copyin transfers: 12
             device time(us): total=74 max=12 min=5 avg=6
/HA/sources/begou/SUB_ARRAY_OFFLOAD/TESTCASE_1/array_m.F90
  moy_r1  NVIDIA  devicenum=0
    time(us): 1,223
    524: update directive reached 1 time
        524: data copyin transfers: 2
             device time(us): total=888 max=449 min=439 avg=444
    525: data region reached 2 times
        525: data copyin transfers: 3
             device time(us): total=18 max=7 min=5 avg=6
    533: update directive reached 1 time
        533: data copyout transfers: 1
             device time(us): total=317 max=317 min=317 avg=317

Moreover if at compile time nvfortran says:

nvfortran  -c -o array_m.o -O1 -g  -DY2_GPU -mp=gpu -gpu=cc80 -target=gpu -Minfo=accel array_m.F90
moy_r1:
    524, Generating update to(b(:),c(:))
    525, Generating implicit map(tofrom:a(:),c(:),b(:)) 
    533, Generating update from(a(:))

The “tofrom” indicated by Minfo=accel at line 525 is not bringing back the A(:) array (at runtime only a copy out for the 3 arrays A(:), B(:) and C(:) is shown and the update directive line 533 is required).

This is of course a beginner question, but if someone could explain me this behavior or suggest a documentation, I have yet the OpenMP Version 5.1 documentation from November 2020 and the OpenMP Application Programming Interface Examples from June 2020 but it did not help me to undestand this.

These are copies of the array descriptors, not the arrays themselves. This is more clear from the output from setting the environment variable “NV_ACC_NOTIFY=3”:

upload CUDA data  file=array_m.F90 function=moy_r1 line=524 device=0 threadid=1 variable=b bytes=8000000
upload CUDA data  file=array_m.F90 function=moy_r1 line=524 device=0 threadid=1 variable=c bytes=8000000
upload CUDA data  file=array_m.F90 function=moy_r1 line=525 device=0 threadid=1 variable=descriptor bytes=128
upload CUDA data  file=array_m.F90 function=moy_r1 line=525 device=0 threadid=1 variable=descriptor bytes=128
upload CUDA data  file=array_m.F90 function=moy_r1 line=525 device=0 threadid=1 variable=descriptor bytes=128
download CUDA data  file=array_m.F90 function=moy_r1 line=533 device=0 threadid=1 variable=a bytes=8000000

525, Generating implicit map(tofrom:a(:),c(:),b(:))

While the implicit mapping will be added, since these arrays are present on the device, they will not actually be copied.

Hope this clarifies things,
Mat

Thanks Mat,
this explain clearly now the requirement of the update of the A array after the Kernel execution. So all is working as expected without unneeded transferts.
So the really small “device time” for the copy at runtime (line 525) in is not a cache effect but just the descriptors transfert time.
Patrick