We are trying to accelerate the NPB-3.3 version of the SP benchmark as a precursor to doing the same to other similar (and longer) codes. I have had success with about half of the code. A good portion of the remainder looks like the subroutine below. When I compile it with -ta=nvidia,cc20,cuda4.0 I receive these messages in the compiler output:
171, Generating copy(lhs(1:5,0:grid_points-1,1:ny2))
Generating compute capability 2.0 binary
172, Loop is parallelizable
173, Loop carried dependence of ‘lhs’ prevents parallelization
Loop carried backward dependence of ‘lhs’ prevents vectorization
Loop carried dependence of ‘rhs’ prevents parallelization
Loop carried backward dependence of ‘rhs’ prevents vectorization
Accelerator kernel generated
172, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
173, !$acc do seq
Non-stride-1 accesses for array ‘lhs’
and
236, Generating copy(rhs(0:grid_points-1,1:ny2,k,4:5))
Generating compute capability 2.0 binary
237, Loop is parallelizable
238, Loop carried dependence of ‘lhsp’ prevents parallelization
Loop carried backward dependence of ‘lhsp’ prevents vectorization
Loop carried dependence of ‘rhs’ prevents parallelization
Loop carried backward dependence of ‘rhs’ prevents vectorization
Loop carried dependence of ‘lhsm’ prevents parallelization
Loop carried backward dependence of ‘lhsm’ prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
237, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
238, !$acc do seq
Non-stride-1 accesses for array ‘lhsp’
Non-stride-1 accesses for array ‘rhs’
Non-stride-1 accesses for array ‘lhsm’
CC 2.0 : 26 registers; 4 shared, 92 constant, 0 local memory bytes; 16% occupancy
The code runs very slowly; pgprof claims the max accelerator region time is 876 sec. (though this cannot possibly be correct; the entire program iterates 400 times in about 450 sec.!) The corresponding max kernel time is 29 sec.
In addition, the results of this function are not correct, as the overall validation for the program output fails.
As you can see, this version was already parallelized with OpenMP. In other cases the OpenMP directives furnised useful clues for writing the PGI directives, but in this case I don’t know how, for instance, to implement the threadprivate attribute for the ‘cv’ array.
I have tried several variations of directives, including designating the entire function as one region. I have also been able to designate the first two loops inside the initial ‘k’ loop for acceleration, with correct ouput but again very slow execution. I don’t know how to interperet the compiler output and apparently also do not understand the profiler data either. Any assistance appreciated.
subroutine x_solve
#undef USE_OMP
!---------------------------------------------------------------------
!---------------------------------------------------------------------
!---------------------------------------------------------------------
! this function performs the solution of the approximate factorization
! step in the x-direction for all five matrix components
! simultaneously. The Thomas algorithm is employed to solve the
! systems for the x-lines. Boundary conditions are non-periodic
!---------------------------------------------------------------------
implicit none
integer problem_size
parameter (problem_size=36)
integer IMAX, JMAX, KMAX, IMAXP, JMAXP
parameter (IMAX=problem_size,JMAX=problem_size,KMAX=problem_size)
parameter (IMAXP=IMAX/2*2,JMAXP=JMAX/2*2)
! externally defined constants, initialized elsewhere
double precision c3c4, c1c5, c2dttx1, con43,
> comz1, comz4, comz5, comz6,
> dttx1, dttx2, dx1, dx2, dx5, dxmax
! included here for completeness; this _is_ allocated before entry
double precision, dimension (:,:,:,:), allocatable::rhs
double precision, dimension ( :,:,:), allocatable::rho_i
double precision, dimension ( :,:,:), allocatable::us
double precision, dimension ( :,:,:), allocatable::speed
double precision, target :: rhon(0:problem_size-1)
double precision
> lhs (5,0:IMAXP,0:IMAXP),
> lhsp(5,0:IMAXP,0:IMAXP),
> lhsm(5,0:IMAXP,0:IMAXP)
integer grid_points(3), nx2, ny2, nz2
integer i, j, k, i1, i2, m, ni
double precision ru1, fac1, fac2
double precision,save :: cv(0:problem_size-1)
#if USE_OMP
!$omp threadprivate(cv)
#endif
!---------------------------------------------------------------------
!---------------------------------------------------------------------
#if USE_OMP
!$omp parallel do default(shared) private(i,j,k,i1,i2,m,
!$omp& ru1,fac1,fac2)
#endif
!$acc data region copyin(rho_i, us, grid_points, speed)
!$acc& copy(lhsm, lhsp)
ni = nx2 + 1
do k = 1, nz2
do j = 1, ny2
do m = 1, 5
lhs (m,0,j) = 0.0d0
lhsp(m,0,j) = 0.0d0
lhsm(m,0,j) = 0.0d0
lhs (m,ni,j) = 0.0d0
lhsp(m,ni,j) = 0.0d0
lhsm(m,ni,j) = 0.0d0
end do
lhs (3,0,j) = 1.0d0
lhsp(3,0,j) = 1.0d0
lhsm(3,0,j) = 1.0d0
lhs (3,ni,j) = 1.0d0
lhsp(3,ni,j) = 1.0d0
lhsm(3,ni,j) = 1.0d0
end do
!---------------------------------------------------------------------
! Computes the left hand side for the three x-factors
!---------------------------------------------------------------------
!---------------------------------------------------------------------
! first fill the lhs for the u-eigenvalue
!---------------------------------------------------------------------
!$acc region do parallel private(cv, rhon) copy(lhs)
do j = 1, ny2
do i = 0, grid_points(1)-1
ru1 = c3c4*rho_i(i,j,k)
cv(i) = us(i,j,k)
rhon(i) = dmax1(dx2+con43*ru1,
> dx5+c1c5*ru1,
> dxmax+ru1,
> dx1)
end do
do i = 1, nx2
lhs(1,i,j) = 0.0d0
lhs(2,i,j) = - dttx2 * cv(i-1) - dttx1 * rhon(i-1)
lhs(3,i,j) = 1.0d0 + c2dttx1 * rhon(i)
lhs(4,i,j) = dttx2 * cv(i+1) - dttx1 * rhon(i+1)
lhs(5,i,j) = 0.0d0
end do
end do
!---------------------------------------------------------------------
! add fourth order dissipation
!---------------------------------------------------------------------
!$acc region do parallel
do j = 1, ny2
i = 1
lhs(3,i,j) = lhs(3,i,j) + comz5
lhs(4,i,j) = lhs(4,i,j) - comz4
lhs(5,i,j) = lhs(5,i,j) + comz1
lhs(2,i+1,j) = lhs(2,i+1,j) - comz4
lhs(3,i+1,j) = lhs(3,i+1,j) + comz6
lhs(4,i+1,j) = lhs(4,i+1,j) - comz4
lhs(5,i+1,j) = lhs(5,i+1,j) + comz1
end do
do j = 1, ny2
do i=3, grid_points(1)-4
lhs(1,i,j) = lhs(1,i,j) + comz1
lhs(2,i,j) = lhs(2,i,j) - comz4
lhs(3,i,j) = lhs(3,i,j) + comz6
lhs(4,i,j) = lhs(4,i,j) - comz4
lhs(5,i,j) = lhs(5,i,j) + comz1
end do
end do
!$acc region do parallel
do j = 1, ny2
i = grid_points(1)-3
lhs(1,i,j) = lhs(1,i,j) + comz1
lhs(2,i,j) = lhs(2,i,j) - comz4
lhs(3,i,j) = lhs(3,i,j) + comz6
lhs(4,i,j) = lhs(4,i,j) - comz4
lhs(1,i+1,j) = lhs(1,i+1,j) + comz1
lhs(2,i+1,j) = lhs(2,i+1,j) - comz4
lhs(3,i+1,j) = lhs(3,i+1,j) + comz5
end do
!---------------------------------------------------------------------
! subsequently, fill the other factors (u+c), (u-c) by adding to
! the first
!---------------------------------------------------------------------
!$acc region do parallel
do j = 1, ny2
do i = 1, nx2
lhsp(1,i,j) = lhs(1,i,j)
lhsp(2,i,j) = lhs(2,i,j) -
> dttx2 * speed(i-1,j,k)
lhsp(3,i,j) = lhs(3,i,j)
lhsp(4,i,j) = lhs(4,i,j) +
> dttx2 * speed(i+1,j,k)
lhsp(5,i,j) = lhs(5,i,j)
lhsm(1,i,j) = lhs(1,i,j)
lhsm(2,i,j) = lhs(2,i,j) +
> dttx2 * speed(i-1,j,k)
lhsm(3,i,j) = lhs(3,i,j)
lhsm(4,i,j) = lhs(4,i,j) -
> dttx2 * speed(i+1,j,k)
lhsm(5,i,j) = lhs(5,i,j)
end do
end do
!---------------------------------------------------------------------
! FORWARD ELIMINATION
!---------------------------------------------------------------------
!---------------------------------------------------------------------
! perform the Thomas algorithm; first, FORWARD ELIMINATION
!---------------------------------------------------------------------
!$acc region do parallel private(rhs)
do j = 1, ny2
do i = 0, grid_points(1)-3
i1 = i + 1
i2 = i + 2
fac1 = 1.d0/lhs(3,i,j)
lhs(4,i,j) = fac1*lhs(4,i,j)
lhs(5,i,j) = fac1*lhs(5,i,j)
do m = 1, 3
rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
end do
lhs(3,i1,j) = lhs(3,i1,j) -
> lhs(2,i1,j)*lhs(4,i,j)
lhs(4,i1,j) = lhs(4,i1,j) -
> lhs(2,i1,j)*lhs(5,i,j)
do m = 1, 3
rhs(i1,j,k,m) = rhs(i1,j,k,m) -
> lhs(2,i1,j)*rhs(i,j,k,m)
end do
lhs(2,i2,j) = lhs(2,i2,j) -
> lhs(1,i2,j)*lhs(4,i,j)
lhs(3,i2,j) = lhs(3,i2,j) -
> lhs(1,i2,j)*lhs(5,i,j)
do m = 1, 3
rhs(i2,j,k,m) = rhs(i2,j,k,m) -
> lhs(1,i2,j)*rhs(i,j,k,m)
end do
end do
end do
!---------------------------------------------------------------------
! The last two rows in this grid block are a bit different,
! since they do not have two more rows available for the
! elimination of off-diagonal entries
!---------------------------------------------------------------------
!$acc region do parallel
do j = 1, ny2
i = grid_points(1)-2
i1 = grid_points(1)-1
fac1 = 1.d0/lhs(3,i,j)
lhs(4,i,j) = fac1*lhs(4,i,j)
lhs(5,i,j) = fac1*lhs(5,i,j)
do m = 1, 3
rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
end do
lhs(3,i1,j) = lhs(3,i1,j) -
> lhs(2,i1,j)*lhs(4,i,j)
lhs(4,i1,j) = lhs(4,i1,j) -
> lhs(2,i1,j)*lhs(5,i,j)
do m = 1, 3
rhs(i1,j,k,m) = rhs(i1,j,k,m) -
> lhs(2,i1,j)*rhs(i,j,k,m)
end do
!---------------------------------------------------------------------
! scale the last row immediately
!---------------------------------------------------------------------
fac2 = 1.d0/lhs(3,i1,j)
do m = 1, 3
rhs(i1,j,k,m) = fac2*rhs(i1,j,k,m)
end do
end do
!---------------------------------------------------------------------
! do the u+c and the u-c factors
!---------------------------------------------------------------------
!$acc region do parallel
do j = 1, ny2
do i = 0, grid_points(1)-3
i1 = i + 1
i2 = i + 2
m = 4
fac1 = 1.d0/lhsp(3,i,j)
lhsp(4,i,j) = fac1*lhsp(4,i,j)
lhsp(5,i,j) = fac1*lhsp(5,i,j)
rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
lhsp(3,i1,j) = lhsp(3,i1,j) -
> lhsp(2,i1,j)*lhsp(4,i,j)
lhsp(4,i1,j) = lhsp(4,i1,j) -
> lhsp(2,i1,j)*lhsp(5,i,j)
rhs(i1,j,k,m) = rhs(i1,j,k,m) -
> lhsp(2,i1,j)*rhs(i,j,k,m)
lhsp(2,i2,j) = lhsp(2,i2,j) -
> lhsp(1,i2,j)*lhsp(4,i,j)
lhsp(3,i2,j) = lhsp(3,i2,j) -
> lhsp(1,i2,j)*lhsp(5,i,j)
rhs(i2,j,k,m) = rhs(i2,j,k,m) -
> lhsp(1,i2,j)*rhs(i,j,k,m)
m = 5
fac1 = 1.d0/lhsm(3,i,j)
lhsm(4,i,j) = fac1*lhsm(4,i,j)
lhsm(5,i,j) = fac1*lhsm(5,i,j)
rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
lhsm(3,i1,j) = lhsm(3,i1,j) -
> lhsm(2,i1,j)*lhsm(4,i,j)
lhsm(4,i1,j) = lhsm(4,i1,j) -
> lhsm(2,i1,j)*lhsm(5,i,j)
rhs(i1,j,k,m) = rhs(i1,j,k,m) -
> lhsm(2,i1,j)*rhs(i,j,k,m)
lhsm(2,i2,j) = lhsm(2,i2,j) -
> lhsm(1,i2,j)*lhsm(4,i,j)
lhsm(3,i2,j) = lhsm(3,i2,j) -
> lhsm(1,i2,j)*lhsm(5,i,j)
rhs(i2,j,k,m) = rhs(i2,j,k,m) -
> lhsm(1,i2,j)*rhs(i,j,k,m)
end do
end do
!---------------------------------------------------------------------
! And again the last two rows separately
!---------------------------------------------------------------------
!$acc region do parallel
do j = 1, ny2
i = grid_points(1)-2
i1 = grid_points(1)-1
m = 4
fac1 = 1.d0/lhsp(3,i,j)
lhsp(4,i,j) = fac1*lhsp(4,i,j)
lhsp(5,i,j) = fac1*lhsp(5,i,j)
rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
lhsp(3,i1,j) = lhsp(3,i1,j) -
> lhsp(2,i1,j)*lhsp(4,i,j)
lhsp(4,i1,j) = lhsp(4,i1,j) -
> lhsp(2,i1,j)*lhsp(5,i,j)
rhs(i1,j,k,m) = rhs(i1,j,k,m) -
> lhsp(2,i1,j)*rhs(i,j,k,m)
m = 5
fac1 = 1.d0/lhsm(3,i,j)
lhsm(4,i,j) = fac1*lhsm(4,i,j)
lhsm(5,i,j) = fac1*lhsm(5,i,j)
rhs(i,j,k,m) = fac1*rhs(i,j,k,m)
lhsm(3,i1,j) = lhsm(3,i1,j) -
> lhsm(2,i1,j)*lhsm(4,i,j)
lhsm(4,i1,j) = lhsm(4,i1,j) -
> lhsm(2,i1,j)*lhsm(5,i,j)
rhs(i1,j,k,m) = rhs(i1,j,k,m) -
> lhsm(2,i1,j)*rhs(i,j,k,m)
!---------------------------------------------------------------------
! Scale the last row immediately
!---------------------------------------------------------------------
rhs(i1,j,k,4) = rhs(i1,j,k,4)/lhsp(3,i1,j)
rhs(i1,j,k,5) = rhs(i1,j,k,5)/lhsm(3,i1,j)
end do
!---------------------------------------------------------------------
! BACKSUBSTITUTION
!---------------------------------------------------------------------
!$acc region do parallel
do j = 1, ny2
i = grid_points(1)-2
i1 = grid_points(1)-1
do m = 1, 3
rhs(i,j,k,m) = rhs(i,j,k,m) -
> lhs(4,i,j)*rhs(i1,j,k,m)
end do
rhs(i,j,k,4) = rhs(i,j,k,4) -
> lhsp(4,i,j)*rhs(i1,j,k,4)
rhs(i,j,k,5) = rhs(i,j,k,5) -
> lhsm(4,i,j)*rhs(i1,j,k,5)
end do
!---------------------------------------------------------------------
! The first three factors
!---------------------------------------------------------------------
!$acc region do parallel
do j = 1, ny2
do i = grid_points(1)-3, 0, -1
i1 = i + 1
i2 = i + 2
do m = 1, 3
rhs(i,j,k,m) = rhs(i,j,k,m) -
> lhs(4,i,j)*rhs(i1,j,k,m) -
> lhs(5,i,j)*rhs(i2,j,k,m)
end do
!---------------------------------------------------------------------
! And the remaining two
!---------------------------------------------------------------------
rhs(i,j,k,4) = rhs(i,j,k,4) -
> lhsp(4,i,j)*rhs(i1,j,k,4) -
> lhsp(5,i,j)*rhs(i2,j,k,4)
rhs(i,j,k,5) = rhs(i,j,k,5) -
> lhsm(4,i,j)*rhs(i1,j,k,5) -
> lhsm(5,i,j)*rhs(i2,j,k,5)
end do
end do
end do
!$acc end data region
return
end subroutine
[/code]
(Output from compile)
x_solve:
54, Generating copyin(speed(:,:,:))
Generating copyin(grid_points(:))
Generating copyin(us(:,:,:))
Generating copyin(rho_i(:,:,:))
Generating copy(lhsp(:,:,:))
Generating copy(lhsm(:,:,:))
82, Generating copy(lhs(:,:,:))
Generating compute capability 2.0 binary
83, Loop is parallelizable
84, Loop is parallelizable
Accelerator kernel generated
83, !$acc do parallel ! blockidx%y
84, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
CC 2.0 : 19 registers; 8 shared, 184 constant, 0 local memory bytes; 100% occupancy
93, Loop is parallelizable
Accelerator kernel generated
83, !$acc do parallel ! blockidx%y
Non-stride-1 accesses for array 'lhs'
Cached references to size [34] block of 'rhon'
Cached references to size [34] block of 'cv'
93, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
CC 2.0 : 19 registers; 552 shared, 96 constant, 0 local memory bytes; 16% occupancy
104, Generating copy(lhs(2:5,1:2,1:ny2))
Generating compute capability 2.0 binary
105, Loop is parallelizable
Accelerator kernel generated
105, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
Non-stride-1 accesses for array 'lhs'
CC 2.0 : 20 registers; 4 shared, 88 constant, 0 local memory bytes; 16% occupancy
127, Generating copy(lhs(1:4,grid_points-3:grid_points-2,1:ny2))
Generating compute capability 2.0 binary
128, Loop is parallelizable
Accelerator kernel generated
128, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
Non-stride-1 accesses for array 'lhs'
CC 2.0 : 15 registers; 4 shared, 96 constant, 0 local memory bytes; 16% occupancy
144, Generating copyin(lhs(1:5,1:nx2,1:ny2))
Generating compute capability 2.0 binary
145, Loop is parallelizable
146, Loop is parallelizable
Accelerator kernel generated
145, !$acc do parallel ! blockidx%y
Non-stride-1 accesses for array 'lhs'
Non-stride-1 accesses for array 'lhsp'
Cached references to size [34] block of 'speed'
Non-stride-1 accesses for array 'lhsm'
146, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
CC 2.0 : 27 registers; 280 shared, 116 constant, 0 local memory bytes; 16% occupancy
171, Generating copy(lhs(1:5,0:grid_points-1,1:ny2))
Generating compute capability 2.0 binary
172, Loop is parallelizable
173, Loop carried dependence of 'lhs' prevents parallelization
Loop carried backward dependence of 'lhs' prevents vectorization
Loop carried dependence of 'rhs' prevents parallelization
Loop carried backward dependence of 'rhs' prevents vectorization
Accelerator kernel generated
172, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
173, !$acc do seq
Non-stride-1 accesses for array 'lhs'
CC 2.0 : 23 registers; 4 shared, 120 constant, 0 local memory bytes; 16% occupancy
179, Loop is parallelizable
186, Loop is parallelizable
194, Loop is parallelizable
206, Generating copy(lhs(2:5,grid_points-2:grid_points-1,1:ny2))
Generating copy(rhs(grid_points-2:grid_points-1,1:ny2,k,1:3))
Generating compute capability 2.0 binary
207, Loop is parallelizable
Accelerator kernel generated
207, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
Non-stride-1 accesses for array 'lhs'
CC 2.0 : 20 registers; 4 shared, 96 constant, 0 local memory bytes; 16% occupancy
213, Loop is parallelizable
220, Loop is parallelizable
228, Loop is parallelizable
236, Generating copy(rhs(0:grid_points-1,1:ny2,k,4:5))
Generating compute capability 2.0 binary
237, Loop is parallelizable
238, Loop carried dependence of 'lhsp' prevents parallelization
Loop carried backward dependence of 'lhsp' prevents vectorization
Loop carried dependence of 'rhs' prevents parallelization
Loop carried backward dependence of 'rhs' prevents vectorization
Loop carried dependence of 'lhsm' prevents parallelization
Loop carried backward dependence of 'lhsm' prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
237, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
238, !$acc do seq
Non-stride-1 accesses for array 'lhsp'
Non-stride-1 accesses for array 'rhs'
Non-stride-1 accesses for array 'lhsm'
CC 2.0 : 26 registers; 4 shared, 92 constant, 0 local memory bytes; 16% occupancy
281, Generating copy(rhs(grid_points-2:grid_points-1,1:ny2,k,4:5))
Generating compute capability 2.0 binary
282, Loop is parallelizable
Accelerator kernel generated
282, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
Non-stride-1 accesses for array 'lhsp'
Non-stride-1 accesses for array 'rhs'
Non-stride-1 accesses for array 'lhsm'
CC 2.0 : 23 registers; 4 shared, 104 constant, 0 local memory bytes; 16% occupancy
318, Generating copyin(lhs(4,grid_points-2,1:ny2))
Generating copyin(rhs(grid_points-2:grid_points-1,1:ny2,k,1:5))
Generating copyout(rhs(grid_points-2,1:ny2,k,1:5))
Generating compute capability 2.0 binary
319, Loop is parallelizable
Accelerator kernel generated
319, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
Non-stride-1 accesses for array 'lhsp'
Non-stride-1 accesses for array 'rhs'
Non-stride-1 accesses for array 'lhsm'
Non-stride-1 accesses for array 'lhs'
CC 2.0 : 20 registers; 4 shared, 96 constant, 0 local memory bytes; 16% occupancy
322, Loop is parallelizable
336, Generating copy(rhs(0:grid_points-1,1:ny2,k,1:5))
Generating copyin(lhs(4:5,0:grid_points-3,1:ny2))
Generating compute capability 2.0 binary
337, Loop is parallelizable
338, Loop carried dependence of 'rhs' prevents parallelization
Loop carried backward dependence of 'rhs' prevents vectorization
Accelerator kernel generated
337, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
338, !$acc do seq
Non-stride-1 accesses for array 'lhsp'
Non-stride-1 accesses for array 'rhs'
Non-stride-1 accesses for array 'lhsm'
Non-stride-1 accesses for array 'lhs'
CC 2.0 : 26 registers; 4 shared, 96 constant, 0 local memory bytes; 16% occupancy
341, Loop is parallelizable