OpenACC + MPI / Loop carried dependence prevents parallelization

Hi everyone, this is my first post at all.

I’m compiling a Fortran code with mpif90 (nvhpc 22.3). The code is very complex and for the sake of clarity, I’m reporting only the major parts related to my issue. The code is MPI and I’m trying to accelerate it with OpenACC. This is my first attempt with a serious code and so far I only use OpenACC directives to accelerate simple codes.

The part of the code that I’m trying to accelerate is the following:

SUBROUTINE kin
USE global_mod, ONLY: NsMAX, num_zones, zones, MINi, MAXi, MINj, MAXj, MINk, MAXk
USE common_alloc

INTEGER, VALUE :: B, i, j, k, s

DOUBLE PRECISION, VALUE :: Yi_ijk(NsMAX)
DOUBLE PRECISION, VALUE :: T_ijk, p_ijk,S_y

!--------------------------------------------------------------------------------------------------------

 !$acc data copy(i,j,k,p,p_ijk,T,T_ijk,Yi,Yi_ijk,s,S_y,NsMAX)
 !$acc parallel loop private(i,j,k,s)
 do k= MINk(BBB)-(Ghost-1), MAXk(BBB)+(Ghost-1)
  do j= MINj(BBB)-(Ghost-1), MAXj(BBB)+(Ghost-1)
   do i= MINi(BBB)-(Ghost-1), MAXi(BBB)+(Ghost-1)

    p_ijk  = p(i,j,k)
    T_ijk  = T(i,j,k)
    Yi_ijk = Yi(:,i,j,k) + 1.0d-20

     S_y = 0.0D0
     do s=1,NsMAX
       S_y = S_y + Yi_ijk(s)                    
     enddo
     do s=1,NsMAX
      if (S_y/=0.D0) then
        Yi_ijk(s) = Yi_ijk(s) / S_y
      endif
     enddo

   end do 
  end do 
 end do 
 !$acc end data

    END SUBROUTINE kin

I compile the code with:

mpif90 -c -r8 -acc=gpu -target=gpu -gpu=ccall -Mpreprocess -Mfree -Mextend -Munixlogical -Mbyteswapio -traceback -Mchkptr -Mipa=ptr -Mipa=alias -Mipa=f90ptr -Mchkstk -Mnostack_arrays -Mnofprelaxed -Mnofpapprox -Minfo=accel kin.f90

This is part of a bigger code with hundreds of files and modules.

The issue I’m dealing with is the following:

957, Generating copy(i,j,nsmax,p(:,:,:),s_y,t(:,:,:),t_ijk,p_ijk,s,k,yi_ijk(:)) [if not already present]
958, Generating NVIDIA GPU code
959, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
961, !$acc loop seq
963, !$acc loop seq
968, !$acc loop seq
971, !$acc loop seq
974, !$acc loop seq
958, Generating implicit copyin(maxk(bbb),maxi(bbb),mini(bbb),maxj(bbb),mink(bbb),minj(bbb)) [if not already present]
961, Complex loop carried dependence of yi prevents parallelization
Loop carried dependence of yi_ijk prevents parallelization
Loop carried dependence of yi_ijk prevents vectorization
Loop carried backward dependence of yi_ijk prevents vectorization
Complex loop carried dependence of yi_ijk prevents parallelization
963, Complex loop carried dependence of yi prevents parallelization
Loop carried dependence of yi_ijk prevents parallelization
Loop carried backward dependence of yi_ijk prevents vectorization
968, Reference argument passing prevents parallelization:
Complex loop carried dependence of yi prevents parallelization
972, Reference argument passing prevents parallelization:
976, Reference argument passing prevents parallelization:

“Yi” is defined in a module called “common_alloc” as follows:

double precision, dimension(:,:,:,:), pointer :: Yi

I’ve tried different approaches to solve this issue, also looking for some solutions in the forum. Maybe I didn’t understand at all the error that arise. Does anyone have any idea how I can solve it?

Thanks in advance to all!!

Welcome matteo.cimini,

The compiler is parallelizing the outer “k” loops. The messages are telling why the compiler is unable to auto-parallelize the inner loops.

Given “Yi” is a pointer, it’s possible that it’s pointing to one of the other arrays which would cause a dependency. Not that it is, but it could. In order to auto-parallelize, the compiler must prove independence, which it can’t do here.

The dependency on “Yi_ijk” is correct and if you ran this in parallel, you’d get incorrect results. All threads are using the same “Yi_ijk” so would be writing over each other’s values. Instead, you’ll need to put “Yi_ijk” in a private clause so each thread gets it’s own copy.

Other items:

Scalars are private by default, so no need to have them in a private clause. Doesn’t hurt, but not needed.

However, a variable can’t be in both a copy clause and private clause. You’ll want to remove the scalars from the copy clause.

I’d recommend you collapse the outer loops to expose more parallelism.

Finally, the compiler will want to auto-parallelize the inner “s” loops. The may or may not be beneficial depending on the value of NsMAX. If this is small, you may want to add the flag “-acc=noautopar” so the array syntax and inner loops aren’t parallelized.

Here’s my suggested changes:

% cat kin.F90
module global_mod

  integer :: NsMAX, num_zones, zones
  integer, dimension(3) :: MINi, MAXi, MINj, MAXj, MINk, MAXk

end module global_mod

module common_alloc

double precision, dimension(:,:,:,:), pointer :: Yi
double precision, dimension(:,:,:), allocatable :: p,T
integer :: BBB, Ghost

end module common_alloc


SUBROUTINE kin
USE global_mod, ONLY: NsMAX, num_zones, zones, MINi, MAXi, MINj, MAXj, MINk, MAXk
USE common_alloc
implicit none
INTEGER, VALUE :: B, i, j, k, s

DOUBLE PRECISION, VALUE :: Yi_ijk(NsMAX)
DOUBLE PRECISION, VALUE :: T_ijk, p_ijk, S_y

!--------------------------------------------------------------------------------------------------------

 !acc data copy(i,j,k,p,p_ijk,T,T_ijk,Yi,Yi_ijk,s,S_y,NsMAX)
 !acc parallel loop private(i,j,k,s)

 !$acc parallel loop collapse(3) private(Yi_ijk) copy(p,T,Yi)
 do k= MINk(BBB)-(Ghost-1), MAXk(BBB)+(Ghost-1)
  do j= MINj(BBB)-(Ghost-1), MAXj(BBB)+(Ghost-1)
   do i= MINi(BBB)-(Ghost-1), MAXi(BBB)+(Ghost-1)

    p_ijk  = p(i,j,k)
    T_ijk  = T(i,j,k)
    Yi_ijk = Yi(:,i,j,k) + 1.0d-20

     S_y = 0.0D0
     do s=1,NsMAX
       S_y = S_y + Yi_ijk(s)
     enddo

     do s=1,NsMAX
      if (S_y/=0.D0) then
        Yi_ijk(s) = Yi_ijk(s) / S_y
      endif
     enddo

   end do
  end do
 end do

    END SUBROUTINE kin
% nvfortran -c kin.F90 -acc -Minfo=accel
kin:
     31, Generating implicit copyin(mink(bbb)) [if not already present]
         Generating copy(p(:,:,:),t(:,:,:),yi(:,:,:,:)) [if not already present]
         Generating implicit firstprivate(ghost,bbb,nsmax)
         Generating NVIDIA GPU code
         32, !$acc loop gang collapse(3) ! blockidx%x
         33,   ! blockidx%x collapsed
         34,   ! blockidx%x collapsed
         38, !$acc loop vector(128) ! threadidx%x
         41, !$acc loop vector(128) ! threadidx%x
             Generating implicit reduction(+:s_y)
         45, !$acc loop vector(128) ! threadidx%x
     31, Generating implicit copyin(maxk(bbb),maxi(bbb),mini(bbb),maxj(bbb),minj(bbb)) [if not already present]
     34, Generating implicit firstprivate(s_y,s)
     38, Loop is parallelizable
     41, Loop is parallelizable
     45, Loop is parallelizable
% nvfortran -c kin.F90 -acc=noautopar -Minfo=accel
kin:
     31, Generating implicit copyin(mink(bbb)) [if not already present]
         Generating copy(p(:,:,:),t(:,:,:),yi(:,:,:,:)) [if not already present]
         Generating implicit firstprivate(ghost,bbb,nsmax)
         Generating NVIDIA GPU code
         32, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
         33,   ! blockidx%x threadidx%x collapsed
         34,   ! blockidx%x threadidx%x collapsed
         38, !$acc loop seq
         41, !$acc loop seq
         45, !$acc loop seq
     31, Generating implicit copyin(maxk(bbb),maxi(bbb),mini(bbb),maxj(bbb),minj(bbb)) [if not already present]
     34, Generating implicit firstprivate(s_y,s)
     38, Loop is parallelizable
     41, Loop is parallelizable
     45, Loop is parallelizable
1 Like

Hi MatColgrove,

first I thank you for all your explanations and suggestions. Now it is all more clear to me and I really appreciate it. I implemented your suggested changes and now the situation is much better. In particular, I substituted my previous directives with the following:

!$acc parallel loop collapse(3) private(Yi_ijk) copy(p,T,Yi)

Now when I compile the code the following arise:

960, Generating implicit copyin(mink(bbb)) [if not already present]
         Generating copy(p(:,:,:),t(:,:,:)) [if not already present]
         Generating NVIDIA GPU code
        961, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
        963,   ! blockidx%x threadidx%x collapsed
        965,   ! blockidx%x threadidx%x collapsed
        970, !$acc loop seq
        973, !$acc loop seq
        976, !$acc loop seq
    960, Generating implicit copyin(maxk(bbb),maxi(bbb),mini(bbb),maxj(bbb),minj(bbb)) [if not already present]
         Generating implicit copy(yi_ijk(:)) [if not already present]
         Generating copy(yi(:,:,:,:)) [if not already present]
    970, Reference argument passing prevents parallelization: 
         Complex loop carried dependence of yi prevents parallelization
    974, Reference argument passing prevents parallelization: 
    978, Reference argument passing prevents parallelization:

It still has the “Yi” dependence. Maybe is it due to the fact that the “common_alloc” module is located in another .F90 file?

Thank you again!!

Hi MatColgrove,

I found the problem in my code. It was the “-Mchkptr” flag to cause the last issue.

Thank you again!!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.