Data management with OpenMP5 and devices

patrick.begou · May 3, 2022, 2:02pm

Hi,
I’m working on a small test case to reproduce an error on a very large fortran code (> 1000 files) and I met some troubles understanding the basic data management with OpenMP5 directives. The code is in attachment.
array_m.F90 (18.3 KB)
Makefile (498 Bytes)

The first module array_m is managing a user defined type (r1_t) with a 1D array of double precision as attribute val. The goal is to manage data at this level creating a copy of the array on the device as it is allocated (line 161)

The second module velocity_m manages a user defined type vel2_t using 2 attributes of type r1_t. So the device storage of the arrays is managed by the r1_t type in module array_m.

The last module calcule_m has only one subroutine calculating an average in an offloaded loop. Interesting lines are:

    521              A=>AX%vel_new%val
    522              B=>BX%vel_new%val
    523              C=>CX%vel_new%val
    524 !$omp target update to(B,C)
    525 !$omp target
    526 !$omp parallel do private(i,j)
    527              do i=1, fin
    528                do j=1, 10
    529                 A(i)=0.5*(B(i)+C(i))
    530                enddo
    531              enddo
    532 !$omp end target
    533 !$omp target update from(A)

I do not understand why at run time, there is an update of B and C (line 524) and then B,C and A are also pushed on the device line 525 ?

Accelerator Kernel Timing data
/HA/sources/begou/SUB_ARRAY_OFFLOAD/TESTCASE_1/array_m.F90
  add_new_r1  NVIDIA  devicenum=0
    time(us): 74
    161: data region reached 6 times
        161: data copyin transfers: 12
             device time(us): total=74 max=12 min=5 avg=6
/HA/sources/begou/SUB_ARRAY_OFFLOAD/TESTCASE_1/array_m.F90
  moy_r1  NVIDIA  devicenum=0
    time(us): 1,223
    524: update directive reached 1 time
        524: data copyin transfers: 2
             device time(us): total=888 max=449 min=439 avg=444
    525: data region reached 2 times
        525: data copyin transfers: 3
             device time(us): total=18 max=7 min=5 avg=6
    533: update directive reached 1 time
        533: data copyout transfers: 1
             device time(us): total=317 max=317 min=317 avg=317

Moreover if at compile time nvfortran says:

nvfortran  -c -o array_m.o -O1 -g  -DY2_GPU -mp=gpu -gpu=cc80 -target=gpu -Minfo=accel array_m.F90
moy_r1:
    524, Generating update to(b(:),c(:))
    525, Generating implicit map(tofrom:a(:),c(:),b(:)) 
    533, Generating update from(a(:))

The “tofrom” indicated by Minfo=accel at line 525 is not bringing back the A(:) array (at runtime only a copy out for the 3 arrays A(:), B(:) and C(:) is shown and the update directive line 533 is required).

This is of course a beginner question, but if someone could explain me this behavior or suggest a documentation, I have yet the OpenMP Version 5.1 documentation from November 2020 and the OpenMP Application Programming Interface Examples from June 2020 but it did not help me to undestand this.

MatColgrove · May 3, 2022, 3:43pm

These are copies of the array descriptors, not the arrays themselves. This is more clear from the output from setting the environment variable “NV_ACC_NOTIFY=3”:

upload CUDA data  file=array_m.F90 function=moy_r1 line=524 device=0 threadid=1 variable=b bytes=8000000
upload CUDA data  file=array_m.F90 function=moy_r1 line=524 device=0 threadid=1 variable=c bytes=8000000
upload CUDA data  file=array_m.F90 function=moy_r1 line=525 device=0 threadid=1 variable=descriptor bytes=128
upload CUDA data  file=array_m.F90 function=moy_r1 line=525 device=0 threadid=1 variable=descriptor bytes=128
upload CUDA data  file=array_m.F90 function=moy_r1 line=525 device=0 threadid=1 variable=descriptor bytes=128
download CUDA data  file=array_m.F90 function=moy_r1 line=533 device=0 threadid=1 variable=a bytes=8000000

525, Generating implicit map(tofrom:a(:),c(:),b(:))

While the implicit mapping will be added, since these arrays are present on the device, they will not actually be copied.

Hope this clarifies things,
Mat

patrick.begou · May 3, 2022, 3:52pm

Thanks Mat,
this explain clearly now the requirement of the update of the A array after the Kernel execution. So all is working as expected without unneeded transferts.
So the really small “device time” for the copy at runtime (line 525) in is not a cache effect but just the descriptors transfert time.
Patrick