Fortran OpenACC code performance with nvfortran

pbm · November 1, 2024, 6:31pm

Given the below described code options, do you know why the ikj loop code version is much faster (2.3x factor with an A100 GPU) than the kji loop version ?

I have a code which the main program contains the below described code.
The da2_v1 version of the computing function executes loops in kji order, the da2_v2 version executes loops in ikj order. In both cases, the i loop is vectorized (as it is the loop accessing contiguously to the memory, it is the good practice for memory coalescing).

In kji (da2_v1) version, I let nvfortran (23.11) choosing which loop it vectorizes, the listing tells me it applies:
247, !$acc loop seq
Generating implicit reduction(+:da2)
250, !$acc loop seq
Generating implicit reduction(+:da2)
253, !$acc loop vector ! threadidx%x
Generating implicit reduction(+:da2)
Vector barrier inserted for vector loop reduction
247, Loop is parallelizable
250, Loop is parallelizable
253, Loop is parallelizable
Generated vector simd code for the loop containing reductions
256, FMA (fused multiply-add) instruction(s) generated

In ikj (da2_v2) version of the computing function, I apply “!$acc loop vector reduction(+:da2)” to the i loop, which I have put as the outermost loop. The compiler tells me:
264, Generating NVIDIA GPU code
280, !$acc loop vector ! threadidx%x
Generating reduction(+:da2)
283, !$acc loop seq
286, !$acc loop seq
291, Vector barrier inserted for vector loop reduction
280, Loop is parallelizable
283, Loop is parallelizable
286, Loop is parallelizable
Generated vector simd code for the loop containing reductions
289, FMA (fused multiply-add) instruction(s) generated

Main program:
!$acc parallel loop gang collapse(3) copyin(phi, r, steps, nr, cnt, n_phi) copyout(result3d) do kr = 1, nr
do jr = 1, nr
do ir = 1, nr
result3d(ir, jr, kr) = da2_v1(phi, r(ir:ir), r(jr:jr), r(kr:kr), steps, n_phi, cnt)
!result3d(ir, jr, kr) = da2_v2(phi, r(ir:ir), r(jr:jr), r(kr:kr), steps, n_phi, cnt)
end do
end do
end do
!$acc end parallel loop
!!

da2_v1 (kji) version of the computing function:
!!
pure function da2_v1(phi, iip, jjp, kkp, iim, jjm, kkm, n_phi, n_red, cnt) result(da2)
!$acc routine vector
implicit none

  integer, intent(in) :: n_red(3), n_phi(3)
  real(kind=rp), intent(in) :: phi(n_phi(1), n_phi(2), n_phi(3))
  integer, intent(in) :: iim(n_red(1)), jjm(n_red(2)), kkm(n_red(3))
  integer, intent(in) :: iip(n_red(1)), jjp(n_red(2)), kkp(n_red(3))
  integer, intent(in) :: cnt
  real(kind=rp) :: da2
  integer :: i, j, k, ip, jp, kp, im, jm, km

   da2 = 0
   do k = 1, n_red(3)
       km = kkm(k)
       kp = kkp(k)
       do j = 1, n_red(2)
           jm = jjm(j)
           jp = jjp(j)
           do i = 1, n_red(1)
              im = iim(i)
              ip = iip(i)
              da2 = da2 + (phi(ip, jp, kp) - phi(im, jm, km))**2
          end do
      end do
  end do
  da2 = da2/cnt

end function da2_v1
!!

da2_v2 (ikj) version of the computing function:
!!
pure function da2_v2(phi, iip, jjp, kkp, iim, jjm, kkm, n_phi, n_red, cnt) result(da2)
!$acc routine vector
implicit none

  integer, intent(in) :: n_red(3), n_phi(3)
  real(kind=rp), intent(in) :: phi(n_phi(1), n_phi(2), n_phi(3))
  integer, intent(in) :: iim(n_red(1)), jjm(n_red(2)), kkm(n_red(3))
  integer, intent(in) :: iip(n_red(1)), jjp(n_red(2)), kkp(n_red(3))
  integer, intent(in) :: cnt
  real(kind=rp) :: da2
  integer :: i, j, k, ip, jp, kp, im, jm, km

  da2 = 0
  !$acc loop vector reduction(+:da2)
  do i = 1, n_red(1)
      im = iim(i)
      ip = iip(i)
      do k = 1, n_red(3)
          km = kkm(k)
          kp = kkp(k)
          do j = 1, n_red(2)
              jm = jjm(j)
              jp = jjp(j)
              da2 = da2 + (phi(ip, jp, kp) - phi(im, jm, km))**2
          end do
      end do
  end do
  da2 = da2/cnt

end function da2_v2
!!

MatColgrove · November 1, 2024, 9:24pm

Hi pbm,

In the “kji” case, the outer two loops are being run sequentially by one thread until it goes parallel in the inner loop. Also, a barrier is encounter after each inner loop which can slow things down.

With “ikj”, the inner k and j loops are still sequential per thread, but the threads are being run in parallel so none are waiting. Also now there’s only a single barrier.

While I can be sure that this accounts for the entirety of the difference without a reproducing example, this would be my best guess.

What’s the value of “n_red(1)”? If it’s relatively small (like < 64), then it may be better to make “da2” as sequential routine so “vector” is applied to the outer “ir” loop.

-Mat

pbm · November 1, 2024, 11:50pm

Hi Mat,

Thanks for your interesting reply!
n_red(1) (= n_red(2) = n_red(3)) corresponds to a 3d mesh size, and we focus on 128^3.
I have attached a reproducer (“make && ./a.out 128” Shell command executes it).

increment_4_forum.zip (3.4 KB)

./a.out 128
Mesh size: 128
… Initialize with a sphere
… n_red = 128 128 128
… Compute da2_v1
Walltime elapsed 18.00823180000000 s
… Compute da2_v2
Walltime elapsed 7.814525800000000 s
… Done

MatColgrove · November 5, 2024, 7:35pm

Hi pbm,

What I’d suggest you do is perform a profile of the code using Nsight-Compute. This will give you an in-depth hardware counter based profile for each kernel. While it takes significantly longer, I usually use “ncu --set=full” to collect more data.

I did this for your code and the results were interesting.

v1 uses few registers (71 vs 95) thus has a higher occupancy (46% vs 31%). Both have an L2 cache hit rate of about 100%, though v1 has a 95% L1 hit rate versus v2’s 30%. v1 has very little main memory access, 35MB, while v2 has about 3.5x at 160MB each direction. The warp stalls are bit higher for v1 (mostly waiting for L1/L2 memory), but neither are significant. The theoretical and achieved occupancy are about the same for both, meaning stalls aren’t an issue.

From this data, v1 looks better than v2. However, when you look at the instructions, v1 has about 8 trillion with v2 at 3.2 trillion. Worse, v1 has significantly more memory loads, address calculation, branching, and barriers.

Why? let’s look at the loop:

         da2 = 0
!  SEQ Section, only one thread is executing
         do k = 1, n_red(3)
             do j = 1, n_red(2)
             km = kkm(k)
             kp = kkp(k)
                 jm = jjm(j)
                 jp = jjp(j)
! needs a barrier here
! end SEQ 
! Go parallel here
!$acc loop vector reduction(+:da2)
                 do i = 1, n_red(1)
                    im = iim(i)
                    ip = iip(i)
                    da2 = da2 + (phi(ip, jp, kp) - phi(im, jm, km))**2
                end do
! end parallel
! Need to perform the final reduction for the "i" loop
! includes another barrier
! back to SEQ execution
            end do
        end do
        da2 = da2/cnt

Parallel reductions can be costly. Each thread keeps a partial reduction and then code needs to be inserted after the loop to gather the partial reductions into a single final reduction. In v1, the final reduction code needs to be executed 16,384 times per call. In v2, it’s done one time per call.

While v1 has better memory access, it seems that the reductions, and to a lesser degree the barriers, are hurting the performance.

-Mat

system · November 19, 2024, 7:36pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
openacc routine function efficiency Legacy PGI Compilers	1	3289	July 2, 2018
How to capture positions within nested loops? nvc, nvc++ and nvfortran jetson	8	31	May 28, 2025
OpenACC: Best way to parallelize nested DO loops with data dependency between loops? nvc, nvc++ and nvfortran	14	3461	October 4, 2021
Openacc routine directive nvc, nvc++ and nvfortran	3	687	March 27, 2024
Bug with GPU do concurrent nested loops when calling a subroutine and passing loop indices nvc, nvc++ and nvfortran	2	37	June 19, 2025
OpenACC routine behavior nvfortran nvc, nvc++ and nvfortran	4	27	April 11, 2025
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	76	December 28, 2024
OpenACC kernel running slower than expected Legacy PGI Compilers	4	1343	August 31, 2021
questions from a test code Legacy PGI Compilers	2	2543	October 18, 2012
Nested loop and reductions nvc, nvc++ and nvfortran	4	935	October 12, 2021

Fortran OpenACC code performance with nvfortran

Related topics