GPU offload with OpenMP using NVFORTRAN is not possible in some cases

Hi,

I report that a triple loop containing the Fortran built-in function sum runs fine when compiled with NVFORTRAN with OpenACC enabled, but not when OpenMP is enabled.

The following sample program (sum_test.f90) contains the triple loop that does not run when compiled with only OpenMP enabled, but runs fine when only OpenACC is enabled.

program sum_test
  implicit none
  integer,parameter :: max_array=100, &
                       max_loop=100
  integer :: i,j,k
  real(8) :: a(max_array),b
  integer :: jsta,jend,ksta,kend

  jsta=1
  jend=100
  ksta=1
  kend=100

  a=1.d0
  b=0.d0

!$acc kernels loop collapse(3) reduction(+:b)
!$omp target teams loop reduction(+:b) collapse(3)&
!$omp defaultmap(tofrom:scalar)
  do i=1,max_loop
    do j=jsta,jend
      do k=ksta,kend
        b = b + sum( a )
      enddo
    enddo
  enddo
!$acc end kernels loop
!$omp end target teams loop

  write(6,*) b

end program sum_test

This compilation was done with

nvfortran -mp=gpu sum_test.f90

. The NVIDIA HPC SDK versions I tried in this compilation are 22.11 and 23.11. The compilation succeeds, but the computation does not complete without error messages.

Please let me know if you notice anything.
Thank you in advance.

Hi Sotas and welcome!

It look like the problem is with the “defaultmap(tofrom: scalars)”.

When I add the compiler feedback flag, “-Minfo=mp”, it appear all scalars, including the loop index variables, become shared:

% nvfortran test.f90 -mp=gpu -Minfo=mp
sum_test:
     18, !$omp target teams loop
         18, Generating "nvkernel_MAIN__F1L18_2" GPU kernel
             Generating NVIDIA GPU code
           20, Loop parallelized across teams collapse(3) ! blockidx%x
           21,   ! blockidx%x collapsed
           22,   ! blockidx%x collapsed
           23, Loop parallelized across threads(96) ! threadidx%x
               Generating implicit reduction(+:a$r)
           20, Generating reduction(+:b)
         18, Generating Multicore code
           20, Loop parallelized across threads
               Generating reduction(+:b)
     18, Generating map(tofrom:b,i,jend,j,jsta,ksta,kend,k)
         Generating implicit map(tofrom:a(:))
     23, Loop is parallelizable

Loop indices must be private else it introduces a race condition and undefined behavior.

I’m talking with our OpenMP team to confirm my assessment on what’s going is correct, and if so, if i, j, and, k should remain private. If the standard is really defined this way, then this clause is basically unusable, so I’m leaning towards a compiler issue. I’ll let you know what they say, and submit a report if the problem is on our end.

The work around is to use “firstprivate” instead of “tofrom” for the default map, or leave it off since firstprivate is the default.

-Mat

Our OpenMP team confirmed that it’s a compiler issue so I’ve added a report, TPR #37105.

Hi Mat,

Thank you very much.
I appreciate your confirmation and the report.

Hi sotas,

FYI, TPR #37105 was fixed in our 25.3 release just out today.

-Mat

Hi Mat,

I’m grateful for your cooperation.