OpenMP map(present ...) support

I am trying to use map(present ... ) in our OpenMP directives, but they appear to mostly cause parsing errors. The following example is from the GFortran test suite (map-11.f90):

program main
  implicit none
  integer, parameter :: N = 1000
  integer :: a(N), b(N), c(N), i

  ! Should be able to parse 'present' map modifier.
  !$omp target enter data map (present, to: a, b)

  !$omp target data map (present, to: a, b) map (always, present, from: c)
    !$omp target map (present, to: a, b) map (present, from: c)
      do i = 1, N
        c(i) = a(i) + b(i)
      end do
    !$omp end target
  !$omp end target data

  !$omp target exit data map (always, present, from: c)

  ! Map clauses with 'present' modifier should go ahead of those without.
  !$omp target map (to: a) map (present, to: b) map (from: c)
    do i = 1, N
      c(i) = a(i) + b(i)
    end do
  !$omp end target
end program

When I compile it with Nvfortran 24.5 using the following flags:

$ nvfortran -mp -Minfo=all map-11.f90

then I get the following errors:

NVFORTRAN-S-0034-Syntax error at or near : (map-11.f90: 7)
NVFORTRAN-S-0034-Syntax error at or near : (map-11.f90: 9)
NVFORTRAN-S-0034-Syntax error at or near : (map-11.f90: 10)
NVFORTRAN-S-0034-Syntax error at or near , (map-11.f90: 17)
NVFORTRAN-S-0034-Syntax error at or near : (map-11.f90: 20)

Is map(present ...) currently supported? If not, are there any plans to add support for this clause modifier? (AFAIK it was introduced in OpenMP 5.2).

Hi Marshall,

Is map(present ...) currently supported? If not, are there any plans to add support for this clause modifier? (AFAIK it was introduced in OpenMP 5.2).\

I think it was 5.1, but we only support a subset of 5.0 as documented here: HPC Compilers User's Guide Version 24.11 for ARM, OpenPower, x86

We’re in a transition phase right now where no new features are being added to nvfortran. Instead we teamed with the LLVM community to create an new, built from the ground-up, flang compiler which will eventually replace the current nvfortran. Unfortunately I don’t know their plans for OpenMP in the initial release but suspect they’ll work on later OpenMP standards sometime after that.

The ‘present’ modifier isn’t really needed for most programs as it only triggers an error if the data isn’t present on the device. The semantics of the map clause doesn’t change if the variable is present. It’s only if the data isn’t present that you’d get an extra copy.

-Mat

Thank you Mat, maybe you can help me with a related question.

If we define a subroutine in a file containing this loop:

  !$omp target
  !$omp parallel loop
  do j = js, je ; do i = is, ie
    ! loop calculation...
  enddo ; enddo
  !$omp end target

using these flags

nvfortran -mp=gpu -Minfo=all ...

Then I see the following output

   1040, !$omp target parallel loop
       1040, Generating "nvkernel_mom_eos_wright_calculate_density_derivs_2d_buggy_wright__F1L1040_2" GPU kernel
             Generating NVIDIA GPU code
         1044, Loop parallelized across teams ! blockidx%x
               Loop run sequentially
       1040, Generating Multicore code
         1044, Loop parallelized across threads
   1040, Generating implicit map(tofrom:this,t(:,:),s(:,:),pressure(:,:),drho_dt(:,:),drho_ds(:,:))
   1044, Loop is parallelizable

My question is about the implicit map(tofrom:...) in the output. Will this function always do a copy from host to device?

When we were using OpenACC, we were assuming that the data was always on the GPU, and using !$acc kernels present(...) to prevent these implicit transfer messages. But perhaps it was not necessary?

I can post the complete example (or an MRE) if you would find it more useful.

OpenMP uses the same “present_or” semantics as OpenACC, i.e. if the data is already present, then no action is taken. So in this case the implicit map clauses would only get copied if the variables are not present on the device.

Like you, I do prefer using OpenACC’s “present” clause or “default(present)” for clarity, but you could easily use “copy” instead without any difference, assuming the data is present.

Under the hood, our OpenMP and OpenACC runtime share the same data management code so behave the same.

I’ve come up with a test code which shows what I am trying to accomplish. The repository is available here:

It is a function containing a loop, which is exported to GPU by either OpenACC or OpenMP:

With OpenACC, the loop is

    !$acc data present(x, y, z)
    !$acc kernels
    do i = 1, n
      z(i) = x(i) + y(i)
    enddo
    !$acc end kernels
    !$acc end data

And is called as

!$acc enter data copyin(x, y) create(z)

do k = 1, nk
  call test_subrt(x, y, z)
enddo

!$acc exit data copyout(z)

When profiled with nvprof, there are two copys to GPU and one off the GPU, regardless of nk.

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
      0.800      2     0.400     0.400     0.400     0.400        0.000  [CUDA memcpy Host-to-Device]
      0.400      1     0.400     0.400     0.400     0.400        0.000  [CUDA memcpy Device-to-Host]

I would like to achieve the same thing with OpenMP, but cannot figure out how to define the directives. The function is

    !$omp target
    !$omp parallel loop
    do i = 1, n
      z(i) = x(i) + y(i)
    enddo
    !$omp end target

The function is called as

!$omp target enter data map(to: x, y)

do k = 1, nk
  call test_subrt(x, y, z)
enddo

!$omp target exit data map(from: z)

The data transfer for nk=10 is

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
      4.800     12     0.400     0.400     0.400     0.400        0.000  [CUDA memcpy Host-to-Device]
      4.000     10     0.400     0.400     0.400     0.400        0.000  [CUDA memcpy Device-to-Host]

That is, there is a transfer on every function call. (2+ nk to GPU, nk from the GPU).

Is there a clasue that I can use to prevent these per-call transfers in test_subrt() with OpenMP?

Somewhat unrelated question, but why would the OpenMP transfer count be 4.8 and 4.0 MB? If it were one extra transfer per call, then I would have expected it to be 8.8 and 4.4 MB. (I could imagine the final z-copy being optimized out, but not the x+y per-call copies.)

Hi Marshall,

The problem is that “z” is missing from the data region so needs to get copied each you enter the compute region. You also missed deleting x and y so they’d remain on the device.

Here’s the fix:

!$omp target enter data map(to: x, y) map(alloc:z)
!$acc enter data copyin(x, y) create(z)

do k = 1, nk
  call test_subrt(x, y, z)
enddo

!$omp target exit data map(from: z) map(delete:x,z)
!$acc exit data copyout(z) delete(x,y)

and the output from my nsys profile:

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
      0.800      2     0.400     0.400     0.400     0.400        0.000  [CUDA memcpy Host-to-Device]
      0.400      1     0.400     0.400     0.400     0.400        0.000  [CUDA memcpy Device-to-Host]

Hope this helps,
Mat

Thank you! The map(alloc: z) was the missing piece. I think that I have some misunderstandings about present(), but the main point is that it is not required to limit data transfers.

Using present() in the function did help produce a more meaningful runtime error when I forgot to add create(z), but I think that just means we need to rely more on our testing when using OpenMP.

I think we can safely move forward with either option in Nvidia, thanks again!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.