I have some OpenACC code that runs and gives the correct result for a production science code on the Titan supercomputer. The issue is that the performance is very bad. In fact, it’s much worse than the performance I get on my local workstation with a commodity gamer GPU. I’m hoping the forum may be able to help me understand what’s going on.

Being a production science code, it’s quite dense. I’m going to try providing some snippets and see if that’s enough to make progress. If not, I can try showing more of the code.

Here’s the primary compute region/loop:

```
!$acc data copyin(sold(lo(1):hi(1),lo(2):hi(2),lo(3):hi(3),:)) &
!$acc copyin(tempbar_init(0:hi(3)), ldt) &
!$acc copyout(snew(lo(1):hi(1),lo(2):hi(2),lo(3):hi(3),:)) &
!$acc copyout(rho_omegadot(lo(1):hi(1),lo(2):hi(2),lo(3):hi(3),:) ) &
!$acc copyout(rho_Hnuc(lo(1):hi(1),lo(2):hi(2),lo(3):hi(3))) &
!$acc copyout(rho_Hext(lo(1):hi(1),lo(2):hi(2),lo(3):hi(3)))
!$acc parallel loop gang vector collapse(3) private(rho,x_in,T_in,x_test,x_out) &
!$acc private(rhowdot,rhoH,sumX,n)
do k = lo(3), hi(3)
do j = lo(2), hi(2)
do i = lo(1), hi(1)
rho = sold(i,j,k,rho_comp)
x_in = sold(i,j,k,spec_comp:spec_comp+nspec-1) / rho
if (drive_initial_convection) then
T_in = tempbar_init(k)
else
T_in = sold(i,j,k,temp_comp)
endif
if (ispec_threshold > 0) then
x_test = x_in(ispec_threshold)
else
x_test = ZERO
endif
if (rho > burning_cutoff_density .and. &
( ispec_threshold < 0 .or. &
(ispec_threshold > 0 .and. &
x_test > burner_threshold_cutoff))) then
call burner(rho, T_in, x_in, ldt, x_out, rhowdot, rhoH)
else
x_out = x_in
rhowdot = 0.d0
rhoH = 0.d0
endif
! check if sum{X_k} = 1
sumX = ZERO
do n = 1, nspec
sumX = sumX + x_out(n)
enddo
snew(i,j,k,rho_comp) = sold(i,j,k,rho_comp)
snew(i,j,k,pi_comp) = sold(i,j,k,pi_comp)
snew(i,j,k,spec_comp:spec_comp+nspec-1) = x_out(1:nspec) * rho
rho_omegadot(i,j,k,1:nspec) = rhowdot(1:nspec)
rho_Hnuc(i,j,k) = rhoH
snew(i,j,k,rhoh_comp) = sold(i,j,k,rhoh_comp) &
+ ldt*rho_Hnuc(i,j,k) + ldt*rho_Hext(i,j,k)
snew(i,j,k,trac_comp:trac_comp+ntrac-1) = &
sold(i,j,k,trac_comp:trac_comp+ntrac-1)
enddo
enddo
enddo
!$acc end parallel
!$acc end data
```

The subroutine `burner`

is marked up as `!$acc routine seq`

, as are all routines it calls. Here’s the loop that ultimately does most of the work in each sequential thread:

```
subroutine bdf_advance(ts, neq, npt, y0, t0, y1, t1, dt0, reset, reuse, ierr, initial_call)
!$acc routine seq
type(bdf_ts), intent(inout) :: ts
integer, intent(in ) :: neq, npt
real(dp_t), intent(in ) :: y0(neq,npt), t0, t1, dt0
real(dp_t), intent( out) :: y1(neq,npt)
logical, intent(in ) :: reset, reuse
integer, intent( out) :: ierr
logical, intent(in ) :: initial_call
integer :: k, p, m, n
logical :: retry, linitial
if (reset) call bdf_reset(ts, y0, dt0, reuse)
ierr = BDF_ERR_SUCCESS
ts%t1 = t1; ts%t = t0; ts%ncse = 0; ts%ncdtmin = 0;
do k = 1, bdf_max_iters + 1
call bdf_update(ts) ! update various coeffs (l, tq) based on time-step history
call bdf_predict(ts) ! predict nordsieck array using pascal matrix
call bdf_solve(ts) ! solve for y_n based on predicted y and yd
call bdf_check(ts, retry, ierr) ! check for solver errors and test error estimate
if (.not. retry) then
call bdf_correct(ts) ! new solution looks good, correct history and advance
call bdf_adjust(ts) ! adjust step-size/order
endif
if (ts%t >= t1 .or. ierr /= BDF_ERR_SUCCESS) exit
end do
if (ts%n > ts%max_steps .or. k > bdf_max_iters) then
ierr = BDF_ERR_MAXSTEPS
end if
do p = 1, ts%npt
do m = 1, ts%neq
y1(m,p) = ts%z0(m,p,0)
end do
end do
end subroutine bdf_advance
```

The compilation command and output for the primary OpenACC region looks like this:

```
ftn -module t/Linux.PGI.acc/m -It/Linux.PGI.acc/m -acc -Minfo=acc -I/ccs/home/ajacobs/Codebase/Microphysics/eos/helmholtz -c -o t/Linux.PGI.acc/o/react_state.o /ccs/home/ajacobs/Codebase/MAESTRO/Source/react_state.f90
burner_loop_3d:
764, Generating copyin(sold(lo:hi,lo:hi,lo:hi,:),tempbar_init(0:hi),ldt,rho_hext(lo:hi,lo:hi,lo:hi))
Generating copyout(snew(lo:hi,lo:hi,lo:hi,:),rho_omegadot(lo:hi,lo:hi,lo:hi,:),rho_hnuc(lo:hi,lo:hi,lo:hi))
771, Accelerator kernel generated
Generating Tesla code
775, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
776, ! blockidx%x threadidx%x collapsed
777, ! blockidx%x threadidx%x collapsed
785, !$acc loop seq
821, !$acc loop seq
836, !$acc loop seq
842, !$acc loop seq
852, !$acc loop seq
785, Loop is parallelizable
821, Loop is parallelizable
836, Loop carried reuse of snew prevents parallelization
842, Loop is parallelizable
852, Loop carried reuse of snew prevents parallelization
```

Even when this triple loop is 64**3, this code runs slower than the version compiled without OpenACC and run in serial. Does anyone have insight into why it runs so slow? From PGI_ACC_NOTIFY=3 I know there’s not any data thrashing. There’s just some upload before the loop and download after, and only done with data that is needed. The compute just seems to be running very slow and it’s very hard to profile as to why.