x_solve from SP benchmark

We are trying to accelerate the NPB-3.3 version of the SP benchmark as a precursor to doing the same to other similar (and longer) codes. I have had success with about half of the code. A good portion of the remainder looks like the subroutine below. When I compile it with -ta=nvidia,cc20,cuda4.0 I receive these messages in the compiler output:

171, Generating copy(lhs(1:5,0:grid_points-1,1:ny2))
Generating compute capability 2.0 binary
172, Loop is parallelizable
173, Loop carried dependence of ‘lhs’ prevents parallelization
Loop carried backward dependence of ‘lhs’ prevents vectorization
Loop carried dependence of ‘rhs’ prevents parallelization
Loop carried backward dependence of ‘rhs’ prevents vectorization
Accelerator kernel generated
172, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
173, !$acc do seq
Non-stride-1 accesses for array ‘lhs’

and

236, Generating copy(rhs(0:grid_points-1,1:ny2,k,4:5))
Generating compute capability 2.0 binary
237, Loop is parallelizable
238, Loop carried dependence of ‘lhsp’ prevents parallelization
Loop carried backward dependence of ‘lhsp’ prevents vectorization
Loop carried dependence of ‘rhs’ prevents parallelization
Loop carried backward dependence of ‘rhs’ prevents vectorization
Loop carried dependence of ‘lhsm’ prevents parallelization
Loop carried backward dependence of ‘lhsm’ prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
237, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
238, !$acc do seq
Non-stride-1 accesses for array ‘lhsp’
Non-stride-1 accesses for array ‘rhs’
Non-stride-1 accesses for array ‘lhsm’
CC 2.0 : 26 registers; 4 shared, 92 constant, 0 local memory bytes; 16% occupancy

The code runs very slowly; pgprof claims the max accelerator region time is 876 sec. (though this cannot possibly be correct; the entire program iterates 400 times in about 450 sec.!) The corresponding max kernel time is 29 sec.

In addition, the results of this function are not correct, as the overall validation for the program output fails.

As you can see, this version was already parallelized with OpenMP. In other cases the OpenMP directives furnised useful clues for writing the PGI directives, but in this case I don’t know how, for instance, to implement the threadprivate attribute for the ‘cv’ array.

I have tried several variations of directives, including designating the entire function as one region. I have also been able to designate the first two loops inside the initial ‘k’ loop for acceleration, with correct ouput but again very slow execution. I don’t know how to interperet the compiler output and apparently also do not understand the profiler data either. Any assistance appreciated.

       subroutine x_solve
#undef USE_OMP
!---------------------------------------------------------------------
!---------------------------------------------------------------------

!---------------------------------------------------------------------
! this function performs the solution of the approximate factorization
! step in the x-direction for all five matrix components
! simultaneously. The Thomas algorithm is employed to solve the
! systems for the x-lines. Boundary conditions are non-periodic
!---------------------------------------------------------------------

       implicit none

       integer problem_size
       parameter (problem_size=36)
       integer IMAX, JMAX, KMAX, IMAXP, JMAXP
       parameter (IMAX=problem_size,JMAX=problem_size,KMAX=problem_size)
       parameter (IMAXP=IMAX/2*2,JMAXP=JMAX/2*2)
! externally defined constants, initialized elsewhere
       double precision c3c4, c1c5, c2dttx1, con43, 
     >                  comz1, comz4, comz5, comz6,                  
     >                  dttx1, dttx2, dx1, dx2, dx5, dxmax


! included here for completeness; this _is_ allocated before entry
       double precision, dimension (:,:,:,:), allocatable::rhs
       double precision, dimension (  :,:,:), allocatable::rho_i
       double precision, dimension (  :,:,:), allocatable::us
       double precision, dimension (  :,:,:), allocatable::speed
       double precision, target :: rhon(0:problem_size-1) 

       double precision
     >   lhs (5,0:IMAXP,0:IMAXP), 
     >   lhsp(5,0:IMAXP,0:IMAXP), 
     >   lhsm(5,0:IMAXP,0:IMAXP)
       integer grid_points(3), nx2, ny2, nz2
	   
       integer i, j, k, i1, i2, m, ni
       double precision  ru1, fac1, fac2
       double precision,save :: cv(0:problem_size-1)
#if USE_OMP
!$omp threadprivate(cv)
#endif

!---------------------------------------------------------------------
!---------------------------------------------------------------------

#if USE_OMP
!$omp parallel do default(shared) private(i,j,k,i1,i2,m,
!$omp&    ru1,fac1,fac2)
#endif

!$acc data region copyin(rho_i, us, grid_points, speed)
!$acc&  copy(lhsm, lhsp)
       ni = nx2 + 1
       do  k = 1, nz2
             do j = 1, ny2
                do   m = 1, 5
                   lhs (m,0,j) = 0.0d0
                   lhsp(m,0,j) = 0.0d0
                   lhsm(m,0,j) = 0.0d0
                   lhs (m,ni,j) = 0.0d0
                   lhsp(m,ni,j) = 0.0d0
                   lhsm(m,ni,j) = 0.0d0
                end do
                lhs (3,0,j) = 1.0d0
                lhsp(3,0,j) = 1.0d0
                lhsm(3,0,j) = 1.0d0
                lhs (3,ni,j) = 1.0d0
                lhsp(3,ni,j) = 1.0d0
                lhsm(3,ni,j) = 1.0d0
             end do

!---------------------------------------------------------------------
! Computes the left hand side for the three x-factors  
!---------------------------------------------------------------------

!---------------------------------------------------------------------
!      first fill the lhs for the u-eigenvalue                   
!---------------------------------------------------------------------
!$acc region do parallel private(cv, rhon) copy(lhs)
          do  j = 1, ny2
             do  i = 0, grid_points(1)-1
                ru1 = c3c4*rho_i(i,j,k)
                cv(i) = us(i,j,k)
                rhon(i) = dmax1(dx2+con43*ru1, 
     >                          dx5+c1c5*ru1,
     >                          dxmax+ru1,
     >                          dx1)
             end do

             do  i = 1, nx2
                lhs(1,i,j) =   0.0d0
                lhs(2,i,j) = - dttx2 * cv(i-1) - dttx1 * rhon(i-1)
                lhs(3,i,j) =   1.0d0 + c2dttx1 * rhon(i)
                lhs(4,i,j) =   dttx2 * cv(i+1) - dttx1 * rhon(i+1)
                lhs(5,i,j) =   0.0d0
             end do
          end do
!---------------------------------------------------------------------
!      add fourth order dissipation                             
!---------------------------------------------------------------------
!$acc region do parallel
          do  j = 1, ny2
             i = 1
             lhs(3,i,j) = lhs(3,i,j) + comz5
             lhs(4,i,j) = lhs(4,i,j) - comz4
             lhs(5,i,j) = lhs(5,i,j) + comz1
  
             lhs(2,i+1,j) = lhs(2,i+1,j) - comz4
             lhs(3,i+1,j) = lhs(3,i+1,j) + comz6
             lhs(4,i+1,j) = lhs(4,i+1,j) - comz4
             lhs(5,i+1,j) = lhs(5,i+1,j) + comz1
          end do

          do  j = 1, ny2
             do   i=3, grid_points(1)-4
                lhs(1,i,j) = lhs(1,i,j) + comz1
                lhs(2,i,j) = lhs(2,i,j) - comz4
                lhs(3,i,j) = lhs(3,i,j) + comz6
                lhs(4,i,j) = lhs(4,i,j) - comz4
                lhs(5,i,j) = lhs(5,i,j) + comz1
             end do
          end do

!$acc region do parallel
          do  j = 1, ny2
             i = grid_points(1)-3
             lhs(1,i,j) = lhs(1,i,j) + comz1
             lhs(2,i,j) = lhs(2,i,j) - comz4
             lhs(3,i,j) = lhs(3,i,j) + comz6
             lhs(4,i,j) = lhs(4,i,j) - comz4

             lhs(1,i+1,j) = lhs(1,i+1,j) + comz1
             lhs(2,i+1,j) = lhs(2,i+1,j) - comz4
             lhs(3,i+1,j) = lhs(3,i+1,j) + comz5
          end do

!---------------------------------------------------------------------
!      subsequently, fill the other factors (u+c), (u-c) by adding to 
!      the first  
!---------------------------------------------------------------------
!$acc region do parallel
          do  j = 1, ny2
             do   i = 1, nx2
                lhsp(1,i,j) = lhs(1,i,j)
                lhsp(2,i,j) = lhs(2,i,j) - 
     >                            dttx2 * speed(i-1,j,k)
                lhsp(3,i,j) = lhs(3,i,j)
                lhsp(4,i,j) = lhs(4,i,j) + 
     >                            dttx2 * speed(i+1,j,k)
                lhsp(5,i,j) = lhs(5,i,j)
                lhsm(1,i,j) = lhs(1,i,j)
                lhsm(2,i,j) = lhs(2,i,j) + 
     >                            dttx2 * speed(i-1,j,k)
                lhsm(3,i,j) = lhs(3,i,j)
                lhsm(4,i,j) = lhs(4,i,j) - 
     >                            dttx2 * speed(i+1,j,k)
                lhsm(5,i,j) = lhs(5,i,j)
             end do
          end do

!---------------------------------------------------------------------
!                          FORWARD ELIMINATION  
!---------------------------------------------------------------------

!---------------------------------------------------------------------
!      perform the Thomas algorithm; first, FORWARD ELIMINATION     
!---------------------------------------------------------------------
!$acc region do parallel private(rhs)
          do  j = 1, ny2
             do    i = 0, grid_points(1)-3
                i1 = i  + 1
                i2 = i  + 2
                fac1      = 1.d0/lhs(3,i,j)
                lhs(4,i,j)  = fac1*lhs(4,i,j)
                lhs(5,i,j)  = fac1*lhs(5,i,j)
                do    m = 1, 3
                   rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
                end do
                lhs(3,i1,j) = lhs(3,i1,j) -
     >                         lhs(2,i1,j)*lhs(4,i,j)
                lhs(4,i1,j) = lhs(4,i1,j) -
     >                         lhs(2,i1,j)*lhs(5,i,j)
                do    m = 1, 3
                   rhs(i1,j,k,m) = rhs(i1,j,k,m) -
     >                         lhs(2,i1,j)*rhs(i,j,k,m)
                end do
                lhs(2,i2,j) = lhs(2,i2,j) -
     >                         lhs(1,i2,j)*lhs(4,i,j)
                lhs(3,i2,j) = lhs(3,i2,j) -
     >                         lhs(1,i2,j)*lhs(5,i,j)
                do    m = 1, 3
                   rhs(i2,j,k,m) = rhs(i2,j,k,m) -
     >                         lhs(1,i2,j)*rhs(i,j,k,m)
                end do
             end do
          end do

!---------------------------------------------------------------------
!      The last two rows in this grid block are a bit different, 
!      since they do not have two more rows available for the
!      elimination of off-diagonal entries
!---------------------------------------------------------------------
!$acc region do parallel
          do  j = 1, ny2
             i  = grid_points(1)-2
             i1 = grid_points(1)-1
             fac1      = 1.d0/lhs(3,i,j)
             lhs(4,i,j)  = fac1*lhs(4,i,j)
             lhs(5,i,j)  = fac1*lhs(5,i,j)
             do    m = 1, 3
                rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
             end do
             lhs(3,i1,j) = lhs(3,i1,j) -
     >                      lhs(2,i1,j)*lhs(4,i,j)
             lhs(4,i1,j) = lhs(4,i1,j) -
     >                      lhs(2,i1,j)*lhs(5,i,j)
             do    m = 1, 3
                rhs(i1,j,k,m) = rhs(i1,j,k,m) -
     >                      lhs(2,i1,j)*rhs(i,j,k,m)
             end do
!---------------------------------------------------------------------
!            scale the last row immediately 
!---------------------------------------------------------------------
             fac2             = 1.d0/lhs(3,i1,j)
             do    m = 1, 3
                rhs(i1,j,k,m) = fac2*rhs(i1,j,k,m)
             end do
          end do

!---------------------------------------------------------------------
!      do the u+c and the u-c factors                 
!---------------------------------------------------------------------
!$acc region do parallel
          do  j = 1, ny2
             do    i = 0, grid_points(1)-3
                i1 = i  + 1
                i2 = i  + 2
                m = 4
                fac1       = 1.d0/lhsp(3,i,j)
                lhsp(4,i,j)  = fac1*lhsp(4,i,j)
                lhsp(5,i,j)  = fac1*lhsp(5,i,j)
                rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
                lhsp(3,i1,j) = lhsp(3,i1,j) -
     >                        lhsp(2,i1,j)*lhsp(4,i,j)
                lhsp(4,i1,j) = lhsp(4,i1,j) -
     >                        lhsp(2,i1,j)*lhsp(5,i,j)
                rhs(i1,j,k,m) = rhs(i1,j,k,m) -
     >                        lhsp(2,i1,j)*rhs(i,j,k,m)
                lhsp(2,i2,j) = lhsp(2,i2,j) -
     >                        lhsp(1,i2,j)*lhsp(4,i,j)
                lhsp(3,i2,j) = lhsp(3,i2,j) -
     >                        lhsp(1,i2,j)*lhsp(5,i,j)
                rhs(i2,j,k,m) = rhs(i2,j,k,m) -
     >                        lhsp(1,i2,j)*rhs(i,j,k,m)
                m = 5
                fac1       = 1.d0/lhsm(3,i,j)
                lhsm(4,i,j)  = fac1*lhsm(4,i,j)
                lhsm(5,i,j)  = fac1*lhsm(5,i,j)
                rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
                lhsm(3,i1,j) = lhsm(3,i1,j) -
     >                        lhsm(2,i1,j)*lhsm(4,i,j)
                lhsm(4,i1,j) = lhsm(4,i1,j) -
     >                        lhsm(2,i1,j)*lhsm(5,i,j)
                rhs(i1,j,k,m) = rhs(i1,j,k,m) -
     >                        lhsm(2,i1,j)*rhs(i,j,k,m)
                lhsm(2,i2,j) = lhsm(2,i2,j) -
     >                        lhsm(1,i2,j)*lhsm(4,i,j)
                lhsm(3,i2,j) = lhsm(3,i2,j) -
     >                        lhsm(1,i2,j)*lhsm(5,i,j)
                rhs(i2,j,k,m) = rhs(i2,j,k,m) -
     >                        lhsm(1,i2,j)*rhs(i,j,k,m)
             end do
          end do

!---------------------------------------------------------------------
!         And again the last two rows separately
!---------------------------------------------------------------------
!$acc region do parallel
          do  j = 1, ny2
             i  = grid_points(1)-2
             i1 = grid_points(1)-1
             m = 4
             fac1       = 1.d0/lhsp(3,i,j)
             lhsp(4,i,j)  = fac1*lhsp(4,i,j)
             lhsp(5,i,j)  = fac1*lhsp(5,i,j)
             rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
             lhsp(3,i1,j) = lhsp(3,i1,j) -
     >                      lhsp(2,i1,j)*lhsp(4,i,j)
             lhsp(4,i1,j) = lhsp(4,i1,j) -
     >                      lhsp(2,i1,j)*lhsp(5,i,j)
             rhs(i1,j,k,m) = rhs(i1,j,k,m) -
     >                      lhsp(2,i1,j)*rhs(i,j,k,m)
             m = 5
             fac1       = 1.d0/lhsm(3,i,j)
             lhsm(4,i,j)  = fac1*lhsm(4,i,j)
             lhsm(5,i,j)  = fac1*lhsm(5,i,j)
             rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
             lhsm(3,i1,j) = lhsm(3,i1,j) -
     >                      lhsm(2,i1,j)*lhsm(4,i,j)
             lhsm(4,i1,j) = lhsm(4,i1,j) -
     >                      lhsm(2,i1,j)*lhsm(5,i,j)
             rhs(i1,j,k,m) = rhs(i1,j,k,m) -
     >                      lhsm(2,i1,j)*rhs(i,j,k,m)
!---------------------------------------------------------------------
!               Scale the last row immediately
!---------------------------------------------------------------------
             rhs(i1,j,k,4) = rhs(i1,j,k,4)/lhsp(3,i1,j)
             rhs(i1,j,k,5) = rhs(i1,j,k,5)/lhsm(3,i1,j)
          end do


!---------------------------------------------------------------------
!                         BACKSUBSTITUTION 
!---------------------------------------------------------------------
!$acc region do parallel
          do  j = 1, ny2
             i  = grid_points(1)-2
             i1 = grid_points(1)-1
             do   m = 1, 3
                rhs(i,j,k,m) = rhs(i,j,k,m) -
     >                             lhs(4,i,j)*rhs(i1,j,k,m)
             end do

             rhs(i,j,k,4) = rhs(i,j,k,4) -
     >                          lhsp(4,i,j)*rhs(i1,j,k,4)
             rhs(i,j,k,5) = rhs(i,j,k,5) -
     >                          lhsm(4,i,j)*rhs(i1,j,k,5)
          end do

!---------------------------------------------------------------------
!      The first three factors
!---------------------------------------------------------------------
!$acc region do parallel
          do  j = 1, ny2
             do    i = grid_points(1)-3, 0, -1
                i1 = i  + 1
                i2 = i  + 2
                do   m = 1, 3
                   rhs(i,j,k,m) = rhs(i,j,k,m) -
     >                          lhs(4,i,j)*rhs(i1,j,k,m) -
     >                          lhs(5,i,j)*rhs(i2,j,k,m)
                end do

!---------------------------------------------------------------------
!      And the remaining two
!---------------------------------------------------------------------
                rhs(i,j,k,4) = rhs(i,j,k,4) -
     >                          lhsp(4,i,j)*rhs(i1,j,k,4) -
     >                          lhsp(5,i,j)*rhs(i2,j,k,4)
                rhs(i,j,k,5) = rhs(i,j,k,5) -
     >                          lhsm(4,i,j)*rhs(i1,j,k,5) -
     >                          lhsm(5,i,j)*rhs(i2,j,k,5)
             end do
          end do
       end do
!$acc end data region
       return
       end subroutine

[/code]

(Output from compile)
x_solve:
     54, Generating copyin(speed(:,:,:))
         Generating copyin(grid_points(:))
         Generating copyin(us(:,:,:))
         Generating copyin(rho_i(:,:,:))
         Generating copy(lhsp(:,:,:))
         Generating copy(lhsm(:,:,:))
     82, Generating copy(lhs(:,:,:))
         Generating compute capability 2.0 binary
     83, Loop is parallelizable
     84, Loop is parallelizable
         Accelerator kernel generated
         83, !$acc do parallel ! blockidx%y
         84, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 2.0 : 19 registers; 8 shared, 184 constant, 0 local memory bytes; 100% occupancy
     93, Loop is parallelizable
         Accelerator kernel generated
         83, !$acc do parallel ! blockidx%y
             Non-stride-1 accesses for array 'lhs'
             Cached references to size [34] block of 'rhon'
             Cached references to size [34] block of 'cv'
         93, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
             CC 2.0 : 19 registers; 552 shared, 96 constant, 0 local memory bytes; 16% occupancy
    104, Generating copy(lhs(2:5,1:2,1:ny2))
         Generating compute capability 2.0 binary
    105, Loop is parallelizable
         Accelerator kernel generated
        105, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
             Non-stride-1 accesses for array 'lhs'
             CC 2.0 : 20 registers; 4 shared, 88 constant, 0 local memory bytes; 16% occupancy
    127, Generating copy(lhs(1:4,grid_points-3:grid_points-2,1:ny2))
         Generating compute capability 2.0 binary
    128, Loop is parallelizable
         Accelerator kernel generated
        128, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
             Non-stride-1 accesses for array 'lhs'
             CC 2.0 : 15 registers; 4 shared, 96 constant, 0 local memory bytes; 16% occupancy
    144, Generating copyin(lhs(1:5,1:nx2,1:ny2))
         Generating compute capability 2.0 binary
    145, Loop is parallelizable
    146, Loop is parallelizable
         Accelerator kernel generated
        145, !$acc do parallel ! blockidx%y
             Non-stride-1 accesses for array 'lhs'
             Non-stride-1 accesses for array 'lhsp'
             Cached references to size [34] block of 'speed'
             Non-stride-1 accesses for array 'lhsm'
        146, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
             CC 2.0 : 27 registers; 280 shared, 116 constant, 0 local memory bytes; 16% occupancy
    171, Generating copy(lhs(1:5,0:grid_points-1,1:ny2))
         Generating compute capability 2.0 binary
    172, Loop is parallelizable
    173, Loop carried dependence of 'lhs' prevents parallelization
         Loop carried backward dependence of 'lhs' prevents vectorization
         Loop carried dependence of 'rhs' prevents parallelization
         Loop carried backward dependence of 'rhs' prevents vectorization
         Accelerator kernel generated
        172, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
        173, !$acc do seq
             Non-stride-1 accesses for array 'lhs'
             CC 2.0 : 23 registers; 4 shared, 120 constant, 0 local memory bytes; 16% occupancy
    179, Loop is parallelizable
    186, Loop is parallelizable
    194, Loop is parallelizable
    206, Generating copy(lhs(2:5,grid_points-2:grid_points-1,1:ny2))
         Generating copy(rhs(grid_points-2:grid_points-1,1:ny2,k,1:3))
         Generating compute capability 2.0 binary
    207, Loop is parallelizable
         Accelerator kernel generated
        207, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
             Non-stride-1 accesses for array 'lhs'
             CC 2.0 : 20 registers; 4 shared, 96 constant, 0 local memory bytes; 16% occupancy
    213, Loop is parallelizable
    220, Loop is parallelizable
    228, Loop is parallelizable
    236, Generating copy(rhs(0:grid_points-1,1:ny2,k,4:5))
         Generating compute capability 2.0 binary
    237, Loop is parallelizable
    238, Loop carried dependence of 'lhsp' prevents parallelization
         Loop carried backward dependence of 'lhsp' prevents vectorization
         Loop carried dependence of 'rhs' prevents parallelization
         Loop carried backward dependence of 'rhs' prevents vectorization
         Loop carried dependence of 'lhsm' prevents parallelization
         Loop carried backward dependence of 'lhsm' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
        237, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
        238, !$acc do seq
             Non-stride-1 accesses for array 'lhsp'
             Non-stride-1 accesses for array 'rhs'
             Non-stride-1 accesses for array 'lhsm'
             CC 2.0 : 26 registers; 4 shared, 92 constant, 0 local memory bytes; 16% occupancy
    281, Generating copy(rhs(grid_points-2:grid_points-1,1:ny2,k,4:5))
         Generating compute capability 2.0 binary
    282, Loop is parallelizable
         Accelerator kernel generated
        282, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
             Non-stride-1 accesses for array 'lhsp'
             Non-stride-1 accesses for array 'rhs'
             Non-stride-1 accesses for array 'lhsm'
             CC 2.0 : 23 registers; 4 shared, 104 constant, 0 local memory bytes; 16% occupancy
    318, Generating copyin(lhs(4,grid_points-2,1:ny2))
         Generating copyin(rhs(grid_points-2:grid_points-1,1:ny2,k,1:5))
         Generating copyout(rhs(grid_points-2,1:ny2,k,1:5))
         Generating compute capability 2.0 binary
    319, Loop is parallelizable
         Accelerator kernel generated
        319, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
             Non-stride-1 accesses for array 'lhsp'
             Non-stride-1 accesses for array 'rhs'
             Non-stride-1 accesses for array 'lhsm'
             Non-stride-1 accesses for array 'lhs'
             CC 2.0 : 20 registers; 4 shared, 96 constant, 0 local memory bytes; 16% occupancy
    322, Loop is parallelizable
    336, Generating copy(rhs(0:grid_points-1,1:ny2,k,1:5))
         Generating copyin(lhs(4:5,0:grid_points-3,1:ny2))
         Generating compute capability 2.0 binary
    337, Loop is parallelizable
    338, Loop carried dependence of 'rhs' prevents parallelization
         Loop carried backward dependence of 'rhs' prevents vectorization
         Accelerator kernel generated
        337, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
        338, !$acc do seq
             Non-stride-1 accesses for array 'lhsp'
             Non-stride-1 accesses for array 'rhs'
             Non-stride-1 accesses for array 'lhsm'
             Non-stride-1 accesses for array 'lhs'
             CC 2.0 : 26 registers; 4 shared, 96 constant, 0 local memory bytes; 16% occupancy
    341, Loop is parallelizable

Hi Morris,

The wrong answers are most likely due to how you are handling the data initiation. For example, you use a data region to copy the lhsm and lhsp variables then initialise them on the host. You’ll need to either copy the arrays to the device using an update directive, or put the initialisation loop into a accelerator compute region.

Also, you put rhs into a private clause. Private means that all threads have there own complete copy of the array. In other words, every thread will have a distinct 5-D array and causing the program to wast a lot of memory. Worse, the lifetime of a private array ends at the end of the compute region so when rhs is used again later, none of the values set are kept.

Fixing these two issues will get you closer to correct answers. Though, performance will be a challenge due to several issues. First, several loops have dependencies in the inner loops limiting the parallelization. Because of this, there simply isn’t a lot of paralleization. Also, note the low occupancy (~16%). Since the register usage is reasonable, the low occupancy is most likely due to the lack of available blocks that can be created. Finally, there are several arrays that have non-contiguous accesses (ie Non-stride-1).

Fixing the performance issues will require rewriting the code to expose more parallelism and rearranging data structures. I’m not sure if this is allowed by NAS and even if it is, if the algorithm can be changed.

Hope this helps,
Mat

I have made the modifications you mentioned: rhs is no longer marked private, and lhs, lhsm and lhsp are copyout only. Here is the complete ‘data’ region:

!$acc data region copyin(rho_i, us, grid_points, speed)
!$acc&  copyout(lhs, lhsm, lhsp)
!$acc&  copy(rhs)

This still produces incorrect results and is also much slower than the CPU version. For our purposes we are OK with modifying the SP code or algorithm. Can you please give me some ideas for this problem?
I would even consider reworking this into CUDA Fortran if necessary.

Hi Morris,

I took a closer look at the code and think I see a better way. There simply isn’t enough paralleization in the inner loops to see speed-up on a GPU. Also, since you’re iterating over the inner loops many times, you end-up having a high overhead in the number kernel launches.

Instead, we need to parallelize the outer “k” as well as the “j” loops. (See below). Granted, we’re still not there yet as I should this code 2X (~45 seconds) slower on the GPU then a 4 Thread OpenMP run (~25 seconds) using Class A.

My next step would be to look at reorganizing the “lh” arrays so that “j” is the first dimension. Generally, you want the first dimension to correspond to the vector, in this case “j” in order to get contiguous data access across threads (i.e. Stride-1).

Do you want to try doing this? I wont be able to get to it for a few days.

  • Mat
c---------------------------------------------------------------------
c---------------------------------------------------------------------

       subroutine x_solve

c---------------------------------------------------------------------
c---------------------------------------------------------------------

c---------------------------------------------------------------------
c this function performs the solution of the approximate factorization
c step in the x-direction for all five matrix components
c simultaneously. The Thomas algorithm is employed to solve the
c systems for the x-lines. Boundary conditions are non-periodic
c---------------------------------------------------------------------

       include 'header.h'

       integer i, j, k, ii, i1, i2, m, ni, nj
       double precision  ru1, fac1, fac2


c---------------------------------------------------------------------
c---------------------------------------------------------------------

       if (timeron) call timer_start(t_xsolve)

!$omp parallel do default(shared) private(i,j,k,i1,i2,m,
!$omp&    ru1,fac1,fac2)

	ni=nx2+1
	nj=ny2

!$acc region do parallel, private(lhs,lhsm,lhsp) 
!$acc&  copyin(speed,grid_points,us,rho_i)
!$acc&  copy(rhs)                 
       do  k = 1, nz2

!          call lhsinit(nx2+1, ny2)
!$acc do parallel, vector(32), kernel	
       do j = 1, nj
          do   m = 1, 5
             lhs (m,0,j) = 0.0d0
             lhsp(m,0,j) = 0.0d0
             lhsm(m,0,j) = 0.0d0
             lhs (m,ni,j) = 0.0d0
             lhsp(m,ni,j) = 0.0d0
             lhsm(m,ni,j) = 0.0d0
          end do
          lhs (3,0,j) = 1.0d0
          lhsp(3,0,j) = 1.0d0
          lhsm(3,0,j) = 1.0d0
          lhs (3,ni,j) = 1.0d0
          lhsp(3,ni,j) = 1.0d0
          lhsm(3,ni,j) = 1.0d0
       end do

c---------------------------------------------------------------------
c Computes the left hand side for the three x-factors  
c---------------------------------------------------------------------

c---------------------------------------------------------------------
c      first fill the lhs for the u-eigenvalue                   
c---------------------------------------------------------------------
!$acc do parallel, vector(32), kernel private(cv,rhon)
          do  j = 1, ny2
             do  i = 0, grid_points(1)-1
                ru1 = c3c4*rho_i(i,j,k)
                cv(i) = us(i,j,k)
                rhon(i) = dmax1(dx2+con43*ru1, 
     >                          dx5+c1c5*ru1,
     >                          dxmax+ru1,
     >                          dx1)
             end do

             do  i = 1, nx2
                lhs(1,i,j) =   0.0d0
                lhs(2,i,j) = - dttx2 * cv(i-1) - dttx1 * rhon(i-1)
                lhs(3,i,j) =   1.0d0 + c2dttx1 * rhon(i)
                lhs(4,i,j) =   dttx2 * cv(i+1) - dttx1 * rhon(i+1)
                lhs(5,i,j) =   0.0d0
             end do
          end do

c---------------------------------------------------------------------
c      add fourth order dissipation                             
c---------------------------------------------------------------------

          do  j = 1, ny2
             i = 1
             lhs(3,i,j) = lhs(3,i,j) + comz5
             lhs(4,i,j) = lhs(4,i,j) - comz4
             lhs(5,i,j) = lhs(5,i,j) + comz1
  
             lhs(2,i+1,j) = lhs(2,i+1,j) - comz4
             lhs(3,i+1,j) = lhs(3,i+1,j) + comz6
             lhs(4,i+1,j) = lhs(4,i+1,j) - comz4
             lhs(5,i+1,j) = lhs(5,i+1,j) + comz1
          end do

          do  j = 1, ny2
             do   i=3, grid_points(1)-4
                lhs(1,i,j) = lhs(1,i,j) + comz1
                lhs(2,i,j) = lhs(2,i,j) - comz4
                lhs(3,i,j) = lhs(3,i,j) + comz6
                lhs(4,i,j) = lhs(4,i,j) - comz4
                lhs(5,i,j) = lhs(5,i,j) + comz1
             end do
          end do

          do  j = 1, ny2
             i = grid_points(1)-3
             lhs(1,i,j) = lhs(1,i,j) + comz1
             lhs(2,i,j) = lhs(2,i,j) - comz4
             lhs(3,i,j) = lhs(3,i,j) + comz6
             lhs(4,i,j) = lhs(4,i,j) - comz4

             lhs(1,i+1,j) = lhs(1,i+1,j) + comz1
             lhs(2,i+1,j) = lhs(2,i+1,j) - comz4
             lhs(3,i+1,j) = lhs(3,i+1,j) + comz5
          end do

c---------------------------------------------------------------------
c      subsequently, fill the other factors (u+c), (u-c) by adding to 
c      the first  
c---------------------------------------------------------------------
          do  j = 1, ny2
             do   i = 1, nx2
                lhsp(1,i,j) = lhs(1,i,j)
                lhsp(2,i,j) = lhs(2,i,j) - 
     >                            dttx2 * speed(i-1,j,k)
                lhsp(3,i,j) = lhs(3,i,j)
                lhsp(4,i,j) = lhs(4,i,j) + 
     >                            dttx2 * speed(i+1,j,k)
                lhsp(5,i,j) = lhs(5,i,j)
                lhsm(1,i,j) = lhs(1,i,j)
                lhsm(2,i,j) = lhs(2,i,j) + 
     >                            dttx2 * speed(i-1,j,k)
                lhsm(3,i,j) = lhs(3,i,j)
                lhsm(4,i,j) = lhs(4,i,j) - 
     >                            dttx2 * speed(i+1,j,k)
                lhsm(5,i,j) = lhs(5,i,j)
             end do
          end do

c---------------------------------------------------------------------
c                          FORWARD ELIMINATION  
c---------------------------------------------------------------------

c---------------------------------------------------------------------
c      perform the Thomas algorithm; first, FORWARD ELIMINATION     
c---------------------------------------------------------------------

!$acc do parallel, vector(32), kernel
          do  j = 1, ny2
             do    i = 0, grid_points(1)-3
                i1 = i  + 1
                i2 = i  + 2
                fac1      = 1.d0/lhs(3,i,j)
                lhs(4,i,j)  = fac1*lhs(4,i,j)
                lhs(5,i,j)  = fac1*lhs(5,i,j)
                do    m = 1, 3
                   rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
                end do
                lhs(3,i1,j) = lhs(3,i1,j) -
     >                         lhs(2,i1,j)*lhs(4,i,j)
                lhs(4,i1,j) = lhs(4,i1,j) -
     >                         lhs(2,i1,j)*lhs(5,i,j)
                do    m = 1, 3
                   rhs(m,i1,j,k) = rhs(m,i1,j,k) -
     >                         lhs(2,i1,j)*rhs(m,i,j,k)
                end do
                lhs(2,i2,j) = lhs(2,i2,j) -
     >                         lhs(1,i2,j)*lhs(4,i,j)
                lhs(3,i2,j) = lhs(3,i2,j) -
     >                         lhs(1,i2,j)*lhs(5,i,j)
                do    m = 1, 3
                   rhs(m,i2,j,k) = rhs(m,i2,j,k) -
     >                         lhs(1,i2,j)*rhs(m,i,j,k)
                end do
             end do
          end do

c---------------------------------------------------------------------
c      The last two rows in this grid block are a bit different, 
c      since they do not have two more rows available for the
c      elimination of off-diagonal entries
c---------------------------------------------------------------------

c        i =0
c        i1 =0
!$acc do parallel, vector(32), kernel
          do  j = 1, ny2
             i  = grid_points(1)-2
             i1 = grid_points(1)-1
             fac1      = 1.d0/lhs(3,i,j)
             lhs(4,i,j)  = fac1*lhs(4,i,j)
             lhs(5,i,j)  = fac1*lhs(5,i,j)
             do    m = 1, 3
                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
             end do
             lhs(3,i1,j) = lhs(3,i1,j) -
     >                      lhs(2,i1,j)*lhs(4,i,j)
             lhs(4,i1,j) = lhs(4,i1,j) -
     >                      lhs(2,i1,j)*lhs(5,i,j)
             do    m = 1, 3
                rhs(m,i1,j,k) = rhs(m,i1,j,k) -
     >                      lhs(2,i1,j)*rhs(m,i,j,k)
             end do
c---------------------------------------------------------------------
c            scale the last row immediately 
c---------------------------------------------------------------------
             fac2             = 1.d0/lhs(3,i1,j)
             do    m = 1, 3
                rhs(m,i1,j,k) = fac2*rhs(m,i1,j,k)
             end do
          end do
c---------------------------------------------------------------------
c      do the u+c and the u-c factors                 
c---------------------------------------------------------------------
c        i =0
c	i1=0
c	i2=0
!$acc do parallel, vector(32), kernel
          do  j = 1, ny2
            do    i = 0, grid_points(1)-3
                i1 = i  + 1
                i2 = i  + 2
                m = 4
                fac1       = 1.d0/lhsp(3,i,j)
                lhsp(4,i,j)  = fac1*lhsp(4,i,j)
                lhsp(5,i,j)  = fac1*lhsp(5,i,j)
                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
                lhsp(3,i1,j) = lhsp(3,i1,j) -
     >                        lhsp(2,i1,j)*lhsp(4,i,j)
                lhsp(4,i1,j) = lhsp(4,i1,j) -
     >                        lhsp(2,i1,j)*lhsp(5,i,j)
                rhs(m,i1,j,k) = rhs(m,i1,j,k) -
     >                        lhsp(2,i1,j)*rhs(m,i,j,k)
                lhsp(2,i2,j) = lhsp(2,i2,j) -
     >                        lhsp(1,i2,j)*lhsp(4,i,j)
                lhsp(3,i2,j) = lhsp(3,i2,j) -
     >                        lhsp(1,i2,j)*lhsp(5,i,j)
                rhs(m,i2,j,k) = rhs(m,i2,j,k) -
     >                        lhsp(1,i2,j)*rhs(m,i,j,k)
                m = 5
                fac1       = 1.d0/lhsm(3,i,j)
                lhsm(4,i,j)  = fac1*lhsm(4,i,j)
                lhsm(5,i,j)  = fac1*lhsm(5,i,j)
                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
                lhsm(3,i1,j) = lhsm(3,i1,j) -
     >                        lhsm(2,i1,j)*lhsm(4,i,j)
                lhsm(4,i1,j) = lhsm(4,i1,j) -
     >                        lhsm(2,i1,j)*lhsm(5,i,j)
                rhs(m,i1,j,k) = rhs(m,i1,j,k) -
     >                        lhsm(2,i1,j)*rhs(m,i,j,k)
                lhsm(2,i2,j) = lhsm(2,i2,j) -
     >                        lhsm(1,i2,j)*lhsm(4,i,j)
                lhsm(3,i2,j) = lhsm(3,i2,j) -
     >                        lhsm(1,i2,j)*lhsm(5,i,j)
                rhs(m,i2,j,k) = rhs(m,i2,j,k) -
     >                        lhsm(1,i2,j)*rhs(m,i,j,k)
             end do
          end do

c---------------------------------------------------------------------
c         And again the last two rows separately
c---------------------------------------------------------------------
          do  j = 1, ny2
             i  = grid_points(1)-2
             i1 = grid_points(1)-1
             m = 4
             fac1       = 1.d0/lhsp(3,i,j)
             lhsp(4,i,j)  = fac1*lhsp(4,i,j)
             lhsp(5,i,j)  = fac1*lhsp(5,i,j)
             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
             lhsp(3,i1,j) = lhsp(3,i1,j) -
     >                      lhsp(2,i1,j)*lhsp(4,i,j)
             lhsp(4,i1,j) = lhsp(4,i1,j) -
     >                      lhsp(2,i1,j)*lhsp(5,i,j)
             rhs(m,i1,j,k) = rhs(m,i1,j,k) -
     >                      lhsp(2,i1,j)*rhs(m,i,j,k)
             m = 5
             fac1       = 1.d0/lhsm(3,i,j)
             lhsm(4,i,j)  = fac1*lhsm(4,i,j)
             lhsm(5,i,j)  = fac1*lhsm(5,i,j)
             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
             lhsm(3,i1,j) = lhsm(3,i1,j) -
     >                      lhsm(2,i1,j)*lhsm(4,i,j)
             lhsm(4,i1,j) = lhsm(4,i1,j) -
     >                      lhsm(2,i1,j)*lhsm(5,i,j)
             rhs(m,i1,j,k) = rhs(m,i1,j,k) -
     >                      lhsm(2,i1,j)*rhs(m,i,j,k)
c---------------------------------------------------------------------
c               Scale the last row immediately
c---------------------------------------------------------------------
             rhs(4,i1,j,k) = rhs(4,i1,j,k)/lhsp(3,i1,j)
             rhs(5,i1,j,k) = rhs(5,i1,j,k)/lhsm(3,i1,j)
          end do


c---------------------------------------------------------------------
c                         BACKSUBSTITUTION 
c---------------------------------------------------------------------

          do  j = 1, ny2
             i  = grid_points(1)-2
             i1 = grid_points(1)-1
             do   m = 1, 3
                rhs(m,i,j,k) = rhs(m,i,j,k) -
     >                             lhs(4,i,j)*rhs(m,i1,j,k)
             end do

             rhs(4,i,j,k) = rhs(4,i,j,k) -
     >                          lhsp(4,i,j)*rhs(4,i1,j,k)
             rhs(5,i,j,k) = rhs(5,i,j,k) -
     >                          lhsm(4,i,j)*rhs(5,i1,j,k)
          end do

c---------------------------------------------------------------------
c      The first three factors
c---------------------------------------------------------------------
!$acc do parallel, vector(32), kernel
          do  j = 1, ny2
             do    i = grid_points(1)-3, 0, -1
                i1 = i  + 1
                i2 = i  + 2
                do   m = 1, 3
                   rhs(m,i,j,k) = rhs(m,i,j,k) - 
     >                          lhs(4,i,j)*rhs(m,i1,j,k) -
     >                          lhs(5,i,j)*rhs(m,i2,j,k)
                end do

c---------------------------------------------------------------------
c      And the remaining two
c---------------------------------------------------------------------
                rhs(4,i,j,k) = rhs(4,i,j,k) - 
     >                          lhsp(4,i,j)*rhs(4,i1,j,k) -
     >                          lhsp(5,i,j)*rhs(4,i2,j,k)
                rhs(5,i,j,k) = rhs(5,i,j,k) - 
     >                          lhsm(4,i,j)*rhs(5,i1,j,k) -
     >                          lhsm(5,i,j)*rhs(5,i2,j,k)
             end do
          end do

       end do
       if (timeron) call timer_stop(t_xsolve)

c---------------------------------------------------------------------
c      Do the block-diagonal inversion          
c---------------------------------------------------------------------
       call ninvr

       return
       end

I will try the code you posted.
I have already converted compute_rhs, txinvr and a few other subroutines in SP to work using the accelerator directives. I see improvements in execution from 5X to 2X. This is the kernel execution time only, exclusive of data transfer. If I can get the solvers on the GPU, then I expect I can move the data once before the sequence of calls (in the adi module) and once after, and so remove the data transfer overhead from these other routines.
If it would help I can send you the project as I have it now. I see there is no provision for attachments here so I would need an email for you or some other means.

I implemented the changes from your last post and this is much better; large parts of x_solve now execute on the GPU and the results pass the validation step.

I am sending you an archive of my project so that we can be working on the same code. I have modified the column order of most of the common arrays to put ‘j’ or ‘m’ in the last position to provide for more sequential access to memory. Plus putting things in modules, etc. I have a build script in the root dir of the project (named “build”) and I have parameterized the makefile (on the “config” directory) to allow easier selection of different compilers and other options.

I will post again when I have sent the archive to trs/@/pgroup.com.

Your email server is rejecting emails with ZIP file attachments, even if the ZIP file is renamed.

I will instead email you a link where the archive file may be downloaded.

Thanks Morris. I got your link from PGI Customer Support and will look over the code in a bit.

  • Mat