I am learning OpenACC for Fortran and started with these slides as I wanted to start with an elliptic PDE solver. Slide 39 titled “Laplace Solver (OpenACC in Fortran , version 3)” has an incomplete Fortran code which I’ve completed below:
program laplace_gpu
implicit none
integer, parameter :: rows = 10000, columns = 10000, max_iterations = 10000
real, parameter :: max_temp_error = 1.0e-6
integer :: i, j, iteration
real :: start_time, end_time, dt, A_new(rows + 2, columns + 2), A(rows + 2, columns + 2)
do i = 2, rows + 1
A(i, 1) = 1.0
A(i, columns + 2) = 2.0
end do
do j = 2, columns + 1
A(1, j) = 3.0
A(rows + 2, j) = 4.0
end do
dt = 10.0*max_temp_error
iteration = 0
call cpu_time(start_time)
!$acc data copy(A), create(A_new)
do while ( dt > max_temp_error .and. iteration <= max_iterations)
!$acc parallel loop
do j=2,columns+1
do i=2,rows+1
A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) )
enddo
enddo
!$acc end parallel loop
dt=0.0
!$acc parallel loop
do j=2,columns+1
do i=2,rows+1
dt = max( abs(A_new(i,j) - A(i,j)), dt )
A(i,j) = A_new(i,j)
enddo
enddo
!$acc end parallel loop
iteration = iteration+1
enddo
!$acc end data
call cpu_time(end_time)
print *, "Final iteration number:", iteration - 1
print *, end_time - start_time
end program laplace_gpu
I can compile this with nvfortran -acc -fast -Minfo=all -Mneginfo=all laplace_gpu.f90. The run time printed at the end is 163.9991 seconds on my older Nvidia GPU. The info printout says Generating implicit reduction(max:dt) so I suppose nvfortran is smart enough to figure out what is needed.
Note that in slide 39, the presenter has a side note: “Explicitly specify the reduction operator and variable.” However, the associated parallel region has no explicitly specified reduction, which I guess was an oversight. If I modify the second !$acc parallel loop line to be !$acc parallel loop reduction(max:dt), recompile, and run, I find that the code takes 243.6209 seconds. The only difference in the info printout is that it now says Generating reduction(max:dt) (no “implicit” word), which makes sense as the reduction is now explicit.
Why is my explicit reduction slower than my implicit reduction? I think I must be doing something wrong.
Side note: I believe this issue will be present in version 25.9, but I’m using 25.7. I can’t use nvfortran 25.9 as I am using an older Pascal GPU and the release notes for 25.9 says “Maxwell, Pascal, and Volta GPUs are no longer supported starting with CUDA 13.0.”. I just get a somewhat vague compiler error saying error: -arch=compute_61 is an unsupported option if I try 25.9.