Compile error and wrong diagnosis of loop carried dependence

My reproducer is similar to Michael Wolfe’s famous Jacobi example but has a total of 4 kernel loops to implement a Red-Black-SOR solver with separate red and black arrays.

Compilation fails with
PGF90-W-0155 … Missing branch target block
which seems to be a known problem.

Besides that, the diagnosis of loop dependence is definitely incorrect. There is no dependence in this type of loop:

!$acc do parallel
        do ix = sps,spe
          xx = xred(ix,iy)
          c = bred( ix,iy) - xx              &
            - ared2(ix,iy) * xblk(ix  ,iy-1) &
            - ared3(ix,iy) * xblk(ix-1,iy  ) &
            - ared4(ix,iy) * xblk(ix  ,iy  ) &
            - ared5(ix,iy) * xblk(ix  ,iy+1)
          c = omega_f * c
          errff = max(errff, abs(c/xx))
          xred(ix,iy) = xx + c
        end do

Compile flags are:
pgf90 -ta=nvidia:cuda4.0 -tp=nehalem-64 -r8 -Minfo -fast -c rbsor2d.f90
Partial compiler information is

     36, Accelerator scalar kernel generated
     37, Complex loop carried dependence of 'xred' prevents parallelization
         Loop carried dependence due to exposed use of 'xred(:,i1+1)' prevents parallelization
         Inner sequential loop scheduled on host
         Accelerator scalar kernel generated
         Generated 3 alternate versions of the loop
         Generated vector sse code for the loop
         Generated 8 prefetch instructions for the loop

Here is the complete reproducer. Any hints welcome

subroutine red_black_sor(ared2,ared3,ared4,ared5,ablk2,ablk3,ablk4,ablk5, &
                         bred,bblk,rs_red,re_red,lt_red,rs_blk,re_blk,lt_blk, &
                         xred,xblk,omega_f,nrow,itsor,epsor)
  implicit none
! ---------------------------------------------------------------------
  real, dimension(:,:), intent(in)    :: ared2,ared3,ared4,ared5,ablk2,ablk3,ablk4,ablk5
  real, dimension(:,:), intent(in)    :: bred, bblk
  real, dimension(:,:), intent(inout) :: xred, xblk
  integer, dimension(:), intent(in) :: rs_red, re_red, lt_red, rs_blk, re_blk, lt_blk

  real, intent(in)       :: omega_f
  real, intent(inout)    :: epsor
  integer, intent(in)    :: nrow
  integer, intent(inout) :: itsor
! ---------------------------------------------------------------------
  integer :: iter,ix,iy,sps,spe
  real    :: c,errff,xx
! ---------------------------------------------------------------------

!$acc data region local (iter,ix,iy,xx,c,errff,sps,spe)                       &
!$acc             copyin(ared2,ared3,ared4,ared5,ablk2,ablk3,ablk4,ablk5,     &
!$acc                    bred,bblk,rs_red,re_red,lt_red,rs_blk,re_blk,lt_blk) &
!$acc             copy(itsor,xred,xblk) 

      do iter = 1, 10000
        errff = 0.

!$acc region
!$acc do seq
        do iy = 1,nrow ! RED   Split X arrays require 2 different loop versions

            sps = rs_red(iy)
            spe = re_red(iy)
            if (lt_red(iy) == 0) then 
!$acc do parallel
              do ix = sps,spe
                xx = xred(ix,iy)
                c = bred( ix,iy) - xx              &
                  - ared2(ix,iy) * xblk(ix  ,iy-1) &
                  - ared3(ix,iy) * xblk(ix-1,iy  ) &
                  - ared4(ix,iy) * xblk(ix  ,iy  ) &
                  - ared5(ix,iy) * xblk(ix  ,iy+1)
                c = omega_f * c
                errff = max(errff, abs(c/xx))
                xred(ix,iy) = xx + c
              end do  
           else
!$acc do parallel
              do ix = sps,spe
                xx = xred(ix,iy)
                c = bred( ix,iy) - xx              &
                  - ared2(ix,iy) * xblk(ix  ,iy-1) &
                  - ared3(ix,iy) * xblk(ix  ,iy  ) &
                  - ared4(ix,iy) * xblk(ix+1,iy  ) &
                  - ared5(ix,iy) * xblk(ix  ,iy+1)
                c = omega_f * c
                errff = max(errff, abs(c/xx))
                xred(ix,iy) = xx + c
              end do  
            end if
        end do
!$acc end region

!$acc region
!$acc do seq
        do iy = 1,nrow ! BLACK Split X arrays require 2 different loop versions

            sps = rs_blk(iy)
            spe = re_blk(iy)
            if (lt_blk(iy) == 0) then 
!$acc do parallel
              do ix = sps,spe
                xx = xblk(ix,iy)
                c = bblk( ix,iy) - xx              &
                  - ablk2(ix,iy) * xred(ix  ,iy-1) &
                  - ablk3(ix,iy) * xred(ix-1,iy  ) &
                  - ablk4(ix,iy) * xred(ix  ,iy  ) &
                  - ablk5(ix,iy) * xred(ix  ,iy+1)
                c = omega_f * c
                errff = max(errff, abs(c/xx))
                xblk(ix,iy) = xx + c
              end do  
           else
!$acc do parallel
              do ix = sps,spe
                xx = xblk(ix,iy)
                c = bblk( ix,iy) - xx              &
                  - ablk2(ix,iy) * xred(ix  ,iy-1) &
                  - ablk3(ix,iy) * xred(ix  ,iy  ) &
                  - ablk4(ix,iy) * xred(ix+1,iy  ) &
                  - ablk5(ix,iy) * xred(ix  ,iy+1)
                c = omega_f * c
                errff = max(errff, abs(c/xx))
                xblk(ix,iy) = xx + c
              end do  
            end if
        end do

!$acc end region

        if (errff <= epsor) exit ! Converged !

!     ------------------------
!     END OF JACOBI/SOR LOOP
!     ------------------------
      end do
      itsor = itsor + iter

!$acc end data region
 
end subroutine red_black_sor

Hi riedmann,

This one is tough for the compiler to accelerate given how it’s written. With the inner ix loops getting their bound from look-up tables is problematic as is the use of the if statements. Ideally, the ix and iy loops could be interchanged, but due to this dependency, it’s simply not possible.

Basically, you’ll be able to accelerate either only the “ix” loops or the “iy” loop. I agree that the xred dependence message is misleading, but is an artifact of how the compiler is (albeit unsuccessfully) transforming your code.

The good news is that it’s an easy change to parallelize the ix loops:

% cat red_black_sor1.f90
subroutine red_black_sor(ared2,ared3,ared4,ared5,ablk2,ablk3,ablk4,ablk5, &
                         bred,bblk,rs_red,re_red,lt_red,rs_blk,re_blk,lt_blk, &
                         xred,xblk,omega_f,nrow,itsor,epsor)
  implicit none
! ---------------------------------------------------------------------
  real, dimension(:,:), intent(in)    :: ared2,ared3,ared4,ared5,ablk2,ablk3,ablk4,ablk5
  real, dimension(:,:), intent(in)    :: bred, bblk
  real, dimension(:,:), intent(inout) :: xred, xblk
  integer, dimension(:), intent(in) :: rs_red, re_red, lt_red, rs_blk, re_blk, lt_blk

  real, intent(in)       :: omega_f
  real, intent(inout)    :: epsor
  integer, intent(in)    :: nrow
  integer, intent(inout) :: itsor
! ---------------------------------------------------------------------
  integer :: iter,ix,iy,sps,spe
  real    :: c,errff,xx
! ---------------------------------------------------------------------
! NOTE that you don't need use local for scalars.  They are privatize by default 
! Also, no need to copy over the index look-up tables
!$acc data region                  &
!$acc             copyin(ared2,ared3,ared4,ared5,ablk2,ablk3,ablk4,ablk5,     &
!$acc                    bred,bblk) &
!$acc             copy(itsor,xred,xblk)

      do iter = 1, 10000
        errff = 0.

        do iy = 1,nrow ! RED   Split X arrays require 2 different loop versions

            sps = rs_red(iy)
            spe = re_red(iy)
            if (lt_red(iy) == 0) then
!$acc region do parallel
              do ix = sps,spe
                xx = xred(ix,iy)
                c = bred( ix,iy) - xx              &
                  - ared2(ix,iy) * xblk(ix  ,iy-1) &
                  - ared3(ix,iy) * xblk(ix-1,iy  ) &
                  - ared4(ix,iy) * xblk(ix  ,iy  ) &
                  - ared5(ix,iy) * xblk(ix  ,iy+1)
                c = omega_f * c
                errff = max(errff, abs(c/xx))
                xred(ix,iy) = xx + c
              end do 
           else
!$acc region do parallel
              do ix = sps,spe
                xx = xred(ix,iy)
                c = bred( ix,iy) - xx              &
                  - ared2(ix,iy) * xblk(ix  ,iy-1) &
                  - ared3(ix,iy) * xblk(ix  ,iy  ) &
                  - ared4(ix,iy) * xblk(ix+1,iy  ) &
                  - ared5(ix,iy) * xblk(ix  ,iy+1)
                c = omega_f * c
                errff = max(errff, abs(c/xx))
                xred(ix,iy) = xx + c
              end do 
            end if
        end do

        do iy = 1,nrow ! BLACK Split X arrays require 2 different loop versions

            sps = rs_blk(iy)
            spe = re_blk(iy)
            if (lt_blk(iy) == 0) then
!$acc region do parallel
              do ix = sps,spe
                xx = xblk(ix,iy)
                c = bblk( ix,iy) - xx              &
                  - ablk2(ix,iy) * xred(ix  ,iy-1) &
                  - ablk3(ix,iy) * xred(ix-1,iy  ) &
                  - ablk4(ix,iy) * xred(ix  ,iy  ) &
                  - ablk5(ix,iy) * xred(ix  ,iy+1)
                c = omega_f * c
                errff = max(errff, abs(c/xx))
                xblk(ix,iy) = xx + c
              end do 
           else
!$acc region do parallel
              do ix = sps,spe
                xx = xblk(ix,iy)
                c = bblk( ix,iy) - xx              &
                  - ablk2(ix,iy) * xred(ix  ,iy-1) &
                  - ablk3(ix,iy) * xred(ix  ,iy  ) &
                  - ablk4(ix,iy) * xred(ix+1,iy  ) &
                  - ablk5(ix,iy) * xred(ix  ,iy+1)
                c = omega_f * c
                errff = max(errff, abs(c/xx))
                xblk(ix,iy) = xx + c
              end do 
            end if
        end do


        if (errff <= epsor) exit ! Converged !

!     ------------------------
!     END OF JACOBI/SOR LOOP
!     ------------------------
      end do
      itsor = itsor + iter

!$acc end data region
 
end subroutine red_black_sor

To parallelize the “iy” loop, you’ll need to make a few code modifications since accelerated reductions are only available for tightly nested loops.


% cat red_black_sor.2.f90
subroutine red_black_sor(ared2,ared3,ared4,ared5,ablk2,ablk3,ablk4,ablk5, &
                         bred,bblk,rs_red,re_red,lt_red,rs_blk,re_blk,lt_blk, &
                         xred,xblk,omega_f,nrow,itsor,epsor)
  implicit none
! ---------------------------------------------------------------------
  real, dimension(:,:), intent(in)    :: ared2,ared3,ared4,ared5,ablk2,ablk3,ablk4,ablk5
  real, dimension(:,:), intent(in)    :: bred, bblk
  real, dimension(:,:), intent(inout) :: xred, xblk
  integer, dimension(:), intent(in) :: rs_red, re_red, lt_red, rs_blk, re_blk, lt_blk

  real, intent(in)       :: omega_f
  real, intent(inout)    :: epsor
  integer, intent(in)    :: nrow
  integer, intent(inout) :: itsor
! ---------------------------------------------------------------------
  integer :: iter,ix,iy,sps,spe
  real    :: c,errff,xx,esize
  real, allocatable, dimension(:) :: errffa

  esize = max(maxval(re_red),maxval(re_blk))
  allocate(errffa(esize))
! ---------------------------------------------------------------------

!$acc data region                                                             &
!$acc             copyin(ared2,ared3,ared4,ared5,ablk2,ablk3,ablk4,ablk5,     &
!$acc                    bred,bblk,rs_red,re_red,lt_red,rs_blk,re_blk,lt_blk) &
!$acc             copy(itsor,xred,xblk)

      do iter = 1, 10000
        errff = 0.

!$acc region
!$acc do kernel private(errffa)
        do iy = 1,nrow ! RED   Split X arrays require 2 different loop versions
            sps = rs_red(iy)
            spe = re_red(iy)
	    errffa = 0.
            if (lt_red(iy) == 0) then
              do ix = sps,spe
                xx = xred(ix,iy)
                c = bred( ix,iy) - xx              &
                  - ared2(ix,iy) * xblk(ix  ,iy-1) &
                  - ared3(ix,iy) * xblk(ix-1,iy  ) &
                  - ared4(ix,iy) * xblk(ix  ,iy  ) &
                  - ared5(ix,iy) * xblk(ix  ,iy+1)
                c = omega_f * c
                errffa(ix) = abs(c/xx)
                xred(ix,iy) = xx + c
              end do 
           else
              do ix = sps,spe
                xx = xred(ix,iy)
                c = bred( ix,iy) - xx              &
                  - ared2(ix,iy) * xblk(ix  ,iy-1) &
                  - ared3(ix,iy) * xblk(ix  ,iy  ) &
                  - ared4(ix,iy) * xblk(ix+1,iy  ) &
                  - ared5(ix,iy) * xblk(ix  ,iy+1)
                c = omega_f * c
                errffa(ix) = abs(c/xx)
                xred(ix,iy) = xx + c
              end do 
            end if
            errff = max(errff,maxval(errffa))
        end do
!$acc end region

!$acc region
!$acc do kernel private(errffa)
        do iy = 1,nrow ! BLACK Split X arrays require 2 different loop versions

            sps = rs_blk(iy)
            spe = re_blk(iy)
	    errffa = 0.
            if (lt_blk(iy) == 0) then
              do ix = sps,spe
                xx = xblk(ix,iy)
                c = bblk( ix,iy) - xx              &
                  - ablk2(ix,iy) * xred(ix  ,iy-1) &
                  - ablk3(ix,iy) * xred(ix-1,iy  ) &
                  - ablk4(ix,iy) * xred(ix  ,iy  ) &
                  - ablk5(ix,iy) * xred(ix  ,iy+1)
                c = omega_f * c
                errffa(ix) = abs(c/xx)
                xblk(ix,iy) = xx + c
              end do 
           else
              do ix = sps,spe
                xx = xblk(ix,iy)
                c = bblk( ix,iy) - xx              &
                  - ablk2(ix,iy) * xred(ix  ,iy-1) &
                  - ablk3(ix,iy) * xred(ix  ,iy  ) &
                  - ablk4(ix,iy) * xred(ix+1,iy  ) &
                  - ablk5(ix,iy) * xred(ix  ,iy+1)
                c = omega_f * c
                errffa(ix) = abs(c/xx)
                xblk(ix,iy) = xx + c
              end do 
            end if
          errff = max(errff,maxval(errffa))
        end do

!$acc end region

        if (errff <= epsor) exit ! Converged !

!     ------------------------
!     END OF JACOBI/SOR LOOP
!     ------------------------
      end do
      itsor = itsor + iter

!$acc end data region
 
end subroutine red_black_sor

Hope this helps,
Mat

Hello Mat,
thank you for this explanation. That helps to rewrite the code. I replaced the bounds tables with a mask array to allow loop interchange. And I joined the 2 regions into one. The compiler appears happy with that.

Now it runs o.k. but slow. System load indicates a lot of data transfer going on. Normally the entire outermost iteraton loop should be able to execute on the GPU without data transfer in between.
Is there anything else I need to do ?

Michael

Hi Michael,

Have you profiled the code to determine where time performance issue occurs? Add the “time” sub-option (-ta=nvidia,time), PGI’s pgcollect utility, or use the CUDA Profiler (i.e set the environment variable CUDA_PROFILE=1).

The use of the data region should have removed almost all of the data movement from the iteration loop. Though, if you call this routine many times, then I could see how the data movement cost could add up. If this is the case, then next step would be to move the data region to the next level up in the code, and then pass the device data (via the reflected directive) into the subroutine.

  • Mat