Given the below described code options, do you know why the ikj loop code version is much faster (2.3x factor with an A100 GPU) than the kji loop version ?
I have a code which the main program contains the below described code.
The da2_v1 version of the computing function executes loops in kji order, the da2_v2 version executes loops in ikj order. In both cases, the i loop is vectorized (as it is the loop accessing contiguously to the memory, it is the good practice for memory coalescing).
In kji (da2_v1) version, I let nvfortran (23.11) choosing which loop it vectorizes, the listing tells me it applies:
247, !$acc loop seq
Generating implicit reduction(+:da2)
250, !$acc loop seq
Generating implicit reduction(+:da2)
253, !$acc loop vector ! threadidx%x
Generating implicit reduction(+:da2)
Vector barrier inserted for vector loop reduction
247, Loop is parallelizable
250, Loop is parallelizable
253, Loop is parallelizable
Generated vector simd code for the loop containing reductions
256, FMA (fused multiply-add) instruction(s) generated
In ikj (da2_v2) version of the computing function, I apply “!$acc loop vector reduction(+:da2)” to the i loop, which I have put as the outermost loop. The compiler tells me:
264, Generating NVIDIA GPU code
280, !$acc loop vector ! threadidx%x
Generating reduction(+:da2)
283, !$acc loop seq
286, !$acc loop seq
291, Vector barrier inserted for vector loop reduction
280, Loop is parallelizable
283, Loop is parallelizable
286, Loop is parallelizable
Generated vector simd code for the loop containing reductions
289, FMA (fused multiply-add) instruction(s) generated
-
Main program:
!$acc parallel loop gang collapse(3) copyin(phi, r, steps, nr, cnt, n_phi) copyout(result3d) do kr = 1, nr
do jr = 1, nr
do ir = 1, nr
result3d(ir, jr, kr) = da2_v1(phi, r(ir:ir), r(jr:jr), r(kr:kr), steps, n_phi, cnt)
!result3d(ir, jr, kr) = da2_v2(phi, r(ir:ir), r(jr:jr), r(kr:kr), steps, n_phi, cnt)
end do
end do
end do
!$acc end parallel loop
!! -
da2_v1 (kji) version of the computing function:
!!
pure function da2_v1(phi, iip, jjp, kkp, iim, jjm, kkm, n_phi, n_red, cnt) result(da2)
!$acc routine vector
implicit noneinteger, intent(in) :: n_red(3), n_phi(3) real(kind=rp), intent(in) :: phi(n_phi(1), n_phi(2), n_phi(3)) integer, intent(in) :: iim(n_red(1)), jjm(n_red(2)), kkm(n_red(3)) integer, intent(in) :: iip(n_red(1)), jjp(n_red(2)), kkp(n_red(3)) integer, intent(in) :: cnt real(kind=rp) :: da2 integer :: i, j, k, ip, jp, kp, im, jm, km da2 = 0 do k = 1, n_red(3) km = kkm(k) kp = kkp(k) do j = 1, n_red(2) jm = jjm(j) jp = jjp(j) do i = 1, n_red(1) im = iim(i) ip = iip(i) da2 = da2 + (phi(ip, jp, kp) - phi(im, jm, km))**2 end do end do end do da2 = da2/cnt
end function da2_v1
!! -
da2_v2 (ikj) version of the computing function:
!!
pure function da2_v2(phi, iip, jjp, kkp, iim, jjm, kkm, n_phi, n_red, cnt) result(da2)
!$acc routine vector
implicit noneinteger, intent(in) :: n_red(3), n_phi(3) real(kind=rp), intent(in) :: phi(n_phi(1), n_phi(2), n_phi(3)) integer, intent(in) :: iim(n_red(1)), jjm(n_red(2)), kkm(n_red(3)) integer, intent(in) :: iip(n_red(1)), jjp(n_red(2)), kkp(n_red(3)) integer, intent(in) :: cnt real(kind=rp) :: da2 integer :: i, j, k, ip, jp, kp, im, jm, km da2 = 0 !$acc loop vector reduction(+:da2) do i = 1, n_red(1) im = iim(i) ip = iip(i) do k = 1, n_red(3) km = kkm(k) kp = kkp(k) do j = 1, n_red(2) jm = jjm(j) jp = jjp(j) da2 = da2 + (phi(ip, jp, kp) - phi(im, jm, km))**2 end do end do end do da2 = da2/cnt
end function da2_v2
!!