Fortran OpenACC array reduction

mikostul · September 12, 2022, 8:45pm

I am trying to utilize array reduction instead of atomics on this loop. The code compiles fine but has a segmentation fault when I try to run it. Is this a none allowed loop?

!$acc parallel loop collapse(2) default(present) reduction(+:fn,fs)
do k=2,npm-1
do i=1,nblk
fn(i) = fn(i) + flux_t(i, 2,k)*dp(k)
fs(i) = fs(i) + flux_t(i,ntm1,k)*dp(k)
enddo
enddo

Any help would be appreciated.

Miko

MatColgrove · September 12, 2022, 9:25pm

Hi Miko,

Is it a seg fault (host) or illegal address error (device)?

How big are the fn and fs arrays?

In order to do a reduction, each thread is going to get a complete private copy of each array. (allocated as one large block of memory). Depending on how the compiler is scheduling this, it may mean as many as (npm-2)*nblk threads times the size of the arrays.

You can try adding “-mcmodel=medium” so 64-bit offsets are used for large arrays, but you may be better off sticking with atomics as the overhead to do the reduction is high… Alternatively, you can interchanging the loops and only parallelize the i loop so neither atomics nor reductions are needed.

-Mat

mikostul · September 12, 2022, 9:32pm

Hi Mat,

It is “./Run: line 1: 2404805 Segmentation fault (core dumped)”

Miko

mikostul · September 13, 2022, 12:35am

It seems the problem only occurs on the CPU. The code runs find on the GPU with the reduction.

Miko

MatColgrove · September 13, 2022, 3:04pm

Ok, does adding “-mcmodel=medium” help?

Could it be a stack overflow? i.e. does setting “OMP_STACKSIZE=” help?

It could be a compiler issue as well, in which case, can you provide a minimal reproducing example?

mikostul · September 13, 2022, 6:00pm

Hello Mat,

Neither of those helped.

Here is an example code.

program Example
!
      use iso_fortran_env
      implicit none
!
      integer, parameter :: r_typ = REAL64
!      
      integer :: j,k,i
      integer :: nblk,nt,npm
      real(r_typ), dimension(:), allocatable :: fn
      real(r_typ), dimension(:,:,:), allocatable :: flux_t
!
      nblk = 1
      nt = 513
      npm = 1025
!
      allocate (flux_t(nblk,nt,npm))
!$acc enter data create(flux_t)
!
      do concurrent (k=1:npm,j=1:nt,i=1:nblk)
        flux_t(i,j,k) = 1.0
      enddo
!
      allocate (fn(nblk))
!$acc enter data create(fn)
!
      do concurrent (i=1:nblk)
          fn(i) = 0.0
      enddo
!
!$acc parallel loop collapse(2) default(present) reduction(+:fn)
      do k=2,npm-1  
        do i=1,nblk
          fn(i) = fn(i) + flux_t(i,  2,k)
        enddo
      enddo
!
!$acc exit data delete(flux_t,fn)
      deallocate (flux_t)
      deallocate (fn)
!
end program Example

MatColgrove · September 13, 2022, 6:52pm

Thanks Miko, I filed TPR #32397 and sent it to engineering for review.

My best guess is the array reduction is somehow corrupting the base copy of “fn” since the segv occurs when it’s accessed after the parallel loop. Also, if I don’t initialize “fn” in a compute region (or do conncurrent), then the program runs correct. Not sure this would be a viable work around for you:

% cat test.F90
program Example
      use iso_fortran_env
      implicit none
      integer, parameter :: r_typ = REAL64
      integer :: j,k,i
      integer :: nblk,nt,npm
      real(r_typ), dimension(:), allocatable :: fn
      real(r_typ), dimension(:,:,:), allocatable :: flux_t
      nblk = 1
      nt = 513
      npm = 1025
      allocate (flux_t(nblk,nt,npm))
      allocate (fn(nblk))
      flux_t = 1.0
#if defined(CASE1)
!$acc kernels
      fn = 0.0
!$acc end kernels
#else
! works

      fn = 0.0
#endif
!acc parallel loop collapse(2) default(present) reduction(+:fn(:nblk))
!$acc parallel loop reduction(+:fn(1:nblk))
      do k=2,npm-1
        do i=1,nblk
          fn(i) = fn(i) + flux_t(i,  2,k)
        enddo
      enddo

      print *, "HERE"
! segv here when fn is accessed
      print *, fn(1)
      deallocate (flux_t)
      deallocate (fn)
!
end program Example
% nvfortran -acc=multicore test.F90 -g -DCASE1 ; a.out
 HERE
Segmentation fault
% nvfortran -acc=multicore test.F90 -g  ; a.out
 HERE
    1023.000000000000

-Mat

mikostul · September 13, 2022, 8:56pm

Thank you for the help.

Miko