Fortran OpenACC code performance with nvfortran

Given the below described code options, do you know why the ikj loop code version is much faster (2.3x factor with an A100 GPU) than the kji loop version ?


I have a code which the main program contains the below described code.
The da2_v1 version of the computing function executes loops in kji order, the da2_v2 version executes loops in ikj order. In both cases, the i loop is vectorized (as it is the loop accessing contiguously to the memory, it is the good practice for memory coalescing).

In kji (da2_v1) version, I let nvfortran (23.11) choosing which loop it vectorizes, the listing tells me it applies:
247, !$acc loop seq
Generating implicit reduction(+:da2)
250, !$acc loop seq
Generating implicit reduction(+:da2)
253, !$acc loop vector ! threadidx%x
Generating implicit reduction(+:da2)
Vector barrier inserted for vector loop reduction
247, Loop is parallelizable
250, Loop is parallelizable
253, Loop is parallelizable
Generated vector simd code for the loop containing reductions
256, FMA (fused multiply-add) instruction(s) generated

In ikj (da2_v2) version of the computing function, I apply “!$acc loop vector reduction(+:da2)” to the i loop, which I have put as the outermost loop. The compiler tells me:
264, Generating NVIDIA GPU code
280, !$acc loop vector ! threadidx%x
Generating reduction(+:da2)
283, !$acc loop seq
286, !$acc loop seq
291, Vector barrier inserted for vector loop reduction
280, Loop is parallelizable
283, Loop is parallelizable
286, Loop is parallelizable
Generated vector simd code for the loop containing reductions
289, FMA (fused multiply-add) instruction(s) generated

  • Main program:
    !$acc parallel loop gang collapse(3) copyin(phi, r, steps, nr, cnt, n_phi) copyout(result3d) do kr = 1, nr
    do jr = 1, nr
    do ir = 1, nr
    result3d(ir, jr, kr) = da2_v1(phi, r(ir:ir), r(jr:jr), r(kr:kr), steps, n_phi, cnt)
    !result3d(ir, jr, kr) = da2_v2(phi, r(ir:ir), r(jr:jr), r(kr:kr), steps, n_phi, cnt)
    end do
    end do
    end do
    !$acc end parallel loop
    !!

  • da2_v1 (kji) version of the computing function:
    !!
    pure function da2_v1(phi, iip, jjp, kkp, iim, jjm, kkm, n_phi, n_red, cnt) result(da2)
    !$acc routine vector
    implicit none

      integer, intent(in) :: n_red(3), n_phi(3)
      real(kind=rp), intent(in) :: phi(n_phi(1), n_phi(2), n_phi(3))
      integer, intent(in) :: iim(n_red(1)), jjm(n_red(2)), kkm(n_red(3))
      integer, intent(in) :: iip(n_red(1)), jjp(n_red(2)), kkp(n_red(3))
      integer, intent(in) :: cnt
      real(kind=rp) :: da2
      integer :: i, j, k, ip, jp, kp, im, jm, km
    
       da2 = 0
       do k = 1, n_red(3)
           km = kkm(k)
           kp = kkp(k)
           do j = 1, n_red(2)
               jm = jjm(j)
               jp = jjp(j)
               do i = 1, n_red(1)
                  im = iim(i)
                  ip = iip(i)
                  da2 = da2 + (phi(ip, jp, kp) - phi(im, jm, km))**2
              end do
          end do
      end do
      da2 = da2/cnt
    

    end function da2_v1
    !!

  • da2_v2 (ikj) version of the computing function:
    !!
    pure function da2_v2(phi, iip, jjp, kkp, iim, jjm, kkm, n_phi, n_red, cnt) result(da2)
    !$acc routine vector
    implicit none

      integer, intent(in) :: n_red(3), n_phi(3)
      real(kind=rp), intent(in) :: phi(n_phi(1), n_phi(2), n_phi(3))
      integer, intent(in) :: iim(n_red(1)), jjm(n_red(2)), kkm(n_red(3))
      integer, intent(in) :: iip(n_red(1)), jjp(n_red(2)), kkp(n_red(3))
      integer, intent(in) :: cnt
      real(kind=rp) :: da2
      integer :: i, j, k, ip, jp, kp, im, jm, km
    
      da2 = 0
      !$acc loop vector reduction(+:da2)
      do i = 1, n_red(1)
          im = iim(i)
          ip = iip(i)
          do k = 1, n_red(3)
              km = kkm(k)
              kp = kkp(k)
              do j = 1, n_red(2)
                  jm = jjm(j)
                  jp = jjp(j)
                  da2 = da2 + (phi(ip, jp, kp) - phi(im, jm, km))**2
              end do
          end do
      end do
      da2 = da2/cnt
    

    end function da2_v2
    !!

Hi pbm,

In the “kji” case, the outer two loops are being run sequentially by one thread until it goes parallel in the inner loop. Also, a barrier is encounter after each inner loop which can slow things down.

With “ikj”, the inner k and j loops are still sequential per thread, but the threads are being run in parallel so none are waiting. Also now there’s only a single barrier.

While I can be sure that this accounts for the entirety of the difference without a reproducing example, this would be my best guess.

What’s the value of “n_red(1)”? If it’s relatively small (like < 64), then it may be better to make “da2” as sequential routine so “vector” is applied to the outer “ir” loop.

-Mat

Hi Mat,

Thanks for your interesting reply!
n_red(1) (= n_red(2) = n_red(3)) corresponds to a 3d mesh size, and we focus on 128^3.
I have attached a reproducer (“make && ./a.out 128” Shell command executes it).

increment_4_forum.zip (3.4 KB)

  • ./a.out 128
    Mesh size: 128
    … Initialize with a sphere
    … n_red = 128 128 128
    … Compute da2_v1
    Walltime elapsed 18.00823180000000 s
    … Compute da2_v2
    Walltime elapsed 7.814525800000000 s
    … Done

Hi pbm,

What I’d suggest you do is perform a profile of the code using Nsight-Compute. This will give you an in-depth hardware counter based profile for each kernel. While it takes significantly longer, I usually use “ncu --set=full” to collect more data.

I did this for your code and the results were interesting.

v1 uses few registers (71 vs 95) thus has a higher occupancy (46% vs 31%). Both have an L2 cache hit rate of about 100%, though v1 has a 95% L1 hit rate versus v2’s 30%. v1 has very little main memory access, 35MB, while v2 has about 3.5x at 160MB each direction. The warp stalls are bit higher for v1 (mostly waiting for L1/L2 memory), but neither are significant. The theoretical and achieved occupancy are about the same for both, meaning stalls aren’t an issue.

From this data, v1 looks better than v2. However, when you look at the instructions, v1 has about 8 trillion with v2 at 3.2 trillion. Worse, v1 has significantly more memory loads, address calculation, branching, and barriers.

Why? let’s look at the loop:

         da2 = 0
!  SEQ Section, only one thread is executing
         do k = 1, n_red(3)
             do j = 1, n_red(2)
             km = kkm(k)
             kp = kkp(k)
                 jm = jjm(j)
                 jp = jjp(j)
! needs a barrier here
! end SEQ 
! Go parallel here
!$acc loop vector reduction(+:da2)
                 do i = 1, n_red(1)
                    im = iim(i)
                    ip = iip(i)
                    da2 = da2 + (phi(ip, jp, kp) - phi(im, jm, km))**2
                end do
! end parallel
! Need to perform the final reduction for the "i" loop
! includes another barrier
! back to SEQ execution
            end do
        end do
        da2 = da2/cnt

Parallel reductions can be costly. Each thread keeps a partial reduction and then code needs to be inserted after the loop to gather the partial reductions into a single final reduction. In v1, the final reduction code needs to be executed 16,384 times per call. In v2, it’s done one time per call.

While v1 has better memory access, it seems that the reductions, and to a lesser degree the barriers, are hurting the performance.

-Mat

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.