How to capture positions within nested loops?

Hi everyone,

I’m using OpenACC to compute blockwise correlation coefficients between three input arrays (BT1, BT2, BT3) and store the maximum correlation and its offset position for each output pixel. Here’s the relevant Fortran/OpenACC snippet:

!$acc data copyin(BT1,BT2,BT3) copyout(CC11,CC22)
!$acc parallel loop collapse(2) gang vector                     &
!$acc    private(ist2,js2,px,py,E_t,s_t,E_s,s_s,E_ts)             &
!$acc    reduction(max:corr1,corr2)
 do i = 1, amv_n
   do j = 1, amv_n

    ist2 = (i-1)*(sbox_sz/(1+ov_flag)) + 1
    js2 = (j-1)*(sbox_sz/(1+ov_flag)) + 1
    if (ist2<1 .or. ist2+sbox_sz-1>line_l1b) cycle
    if (js2<1 .or. js2+sbox_sz-1>line_l1b) cycle
    if (BT2(ist2+sbox_sz/2, js2+sbox_sz/2)<-500.0) cycle

    E_t = 0.0
    do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
      E_t = E_t + BT2(ist2+dx, js2+dy)
    end do; end do
    E_t = E_t/(sbox_sz*sbox_sz)

    s_t = 0.0
    do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
      s_t = s_t + (BT2(ist2+dx, js2+dy)-E_t)**2
    end do; end do
    s_t = sqrt(s_t/(sbox_sz*sbox_sz))


    corr1 = -1.0
    corr2 = -1.0


    !$acc loop collapse(2) vector private(tx1,ty1, tx2,ty2)
    do x = 0, bbox_sz-sbox_sz
      do y = 0, bbox_sz-sbox_sz
        px = ist2 + x - (bbox_sz-sbox_sz)/(1+ov_flag)
        py = js2 + y - (bbox_sz-sbox_sz)/(1+ov_flag)
        if (px<1 .or. py<1) cycle
        if (px+sbox_sz-1>line_l1b .or. py+sbox_sz-1>line_l1b) cycle

        E_s = 0.0
        !$acc loop reduction(+:E_s)
        do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
          E_s = E_s + BT1(px+dx,py+dy)
        end do; end do
        E_s = E_s/(sbox_sz*sbox_sz)

        s_s  = 0.0
        E_ts = 0.0
        !$acc loop collapse(2) reduction(+:s_s,E_ts)
        do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
          s_s  = s_s  + (BT1(px+dx,py+dy)-E_s)**2
          E_ts = E_ts + (BT2(ist2+dx,js2+dy)-E_t)*(BT1(px+dx,py+dy)-E_s)
        end do; end do
        s_s = sqrt(s_s/(sbox_sz*sbox_sz))
        E_ts = E_ts/(sbox_sz*sbox_sz)

        if (s_t>0.0 .and. s_s>0.0) then
          corr1 = max(corr1, E_ts/(s_t*s_s))
          tx1=x; ty1=y
        end if
        !================================
        E_s  = 0.0
        !$acc loop reduction(+:E_s)
        do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
          E_s = E_s + BT3(px+dx,py+dy)
        end do; end do
        E_s = E_s/(sbox_sz*sbox_sz)

        s_s  = 0.0
        E_ts = 0.0
        !$acc loop collapse(2) reduction(+:s_s,E_ts)
        do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
          s_s  = s_s  + (BT3(px+dx,py+dy)-E_s)**2
          E_ts = E_ts + (BT2(ist2+dx,js2+dy)-E_t)*(BT3(px+dx,py+dy)-E_s)
        end do; end do
        s_s  = sqrt(s_s/(sbox_sz*sbox_sz))
        E_ts = E_ts/(sbox_sz*sbox_sz)

        if (s_t>0.0 .and. s_s>0.0) then
          corr2 = max(corr2, E_ts/(s_t*s_s))
          tx2 = x;ty2 = y
        end if

      end do
    end do

    if (i==amv_n/4 .and. j==amv_n/4) then
      print *, 'DEBUG device: corr1=',corr1,' corr2=',corr2
    end if

    CC11(1,i,j)=corr1; CC22(1,i,j)=corr2
    CC11(2,i,j)=tx1; CC22(2,i,j)=ty1
    CC11(3,i,j)=tx2; CC22(3,i,j)=ty2


  end do
end do
!$acc end parallel
!$acc end data

I’m using reduction(max:corr1,corr2) in nested OpenACC loops to compute maximum correlation values, and while corr1/corr2 are correct, the associated index variables tx1/ty1 remain zero on the host because the built-in max reduction only tracks values, not positions. How can I capture the argmax coordinates (x,y) within these loops—must I resort to a two-pass scan or atomic capture, or is there an OpenACC pattern that handles an index-aware reduction in one go?To illustrate the problem more clearly, I’ve uploaded the minimal test code and sample data to GitHub - random01203/maxcc_loc_test: this code is used for max correlation location test .please feel free to clone the repo and reproduce the issue.

Unfortunately, there’s no way to capture this in a single pass. Each thread is working on a partial reduction and then the compiler needs to insert a second kernel call later to obtain the final reduction. It can’t track which location the max is coming from.

Instead, you need to perform a second pass to find the location.

Thank you very much for your relpy!

Hello! I have another question I’d like to ask you. My current code takes quite a long time to run. I tried using cache to reduce variable reads and speed up the computation, but instead the runtime increased. Could you take a look and tell me what might be wrong with my approach? Is there any part of the code that could be causing such a significant slowdown? Thank you again for your help!

subroutine calculate_VD_acc_tiling(BT1, BT2, BT3, lon, lat,         &
                                 sbox_sz, bbox_sz, ov_flag,       &
                                 amv_n, line_l1b, dt,             &
                                 CC11, CC22)
implicit none
real, dimension(:,:), intent(in)        :: BT1, BT2, BT3, lon, lat
integer, intent(in)                     :: sbox_sz, bbox_sz, amv_n, line_l1b
integer(kind=1), intent(in)             :: ov_flag
real, intent(in)                        :: dt
real, dimension(4, amv_n, amv_n), intent(out) :: CC11, CC22

integer :: i, j, x, y, ist2, js2, px, py, boundary
real    :: E_t, s_t, corr1, corr2, E_s, s_s, E_ts
integer :: dx, dy, xi, yj
real    :: a1, a2

real, dimension(sbox_sz, sbox_sz) :: tile_BT2
real, dimension(bbox_sz, bbox_sz) :: box_BT1, box_BT3

boundary = line_l1b

!$acc parallel loop collapse(2) gang vector                             &
!$acc     private(ist2,js2,px,py,E_t,s_t,E_s,s_s,E_ts,dx,dy,xi,yj,a1,a2) &  
!$acc     private(tile_BT2,box_BT1,box_BT3)                             &  
!$acc     reduction(max:corr1,corr2)                                    &
!$acc     present(BT1,BT2,BT3,lon,lat,CC11,CC22)
do i = 1, amv_n
  do j = 1, amv_n

    ist2 = (i-1)*(sbox_sz/(1+ov_flag)) + 1
    js2 = (j-1)*(sbox_sz/(1+ov_flag)) + 1
    if (ist2 < 1 .or. ist2+sbox_sz-1 > line_l1b) cycle
    if (js2 < 1 .or. js2+sbox_sz-1 > line_l1b) cycle

    corr1 = -1.0
    corr2 = -1.0

    !$acc cache(tile_BT2)
    !$acc loop collapse(2) vector
    do dx = 0, sbox_sz-1
      do dy = 0, sbox_sz-1
        tile_BT2(dx+1,dy+1) = BT2(ist2+dx, js2+dy)
      end do
    end do

    px = ist2 - (bbox_sz - sbox_sz)/(1+ov_flag)
    py = js2 - (bbox_sz - sbox_sz)/(1+ov_flag)
    !$acc cache(box_BT1,box_BT3)
    !$acc loop collapse(2) vector
    do dx = 0, bbox_sz-1
      do dy = 0, bbox_sz-1
        xi = max(1, min(px+dx, line_l1b-sbox_sz+1))
        yj = max(1, min(py+dy, line_l1b-sbox_sz+1))
        box_BT1(dx+1,dy+1) = BT1(xi, yj)
        box_BT3(dx+1,dy+1) = BT3(xi, yj)
      end do
    end do

    E_t = 0.0
    !$acc loop vector reduction(+:E_t)
    do dx = 1, sbox_sz
      do dy = 1, sbox_sz
        E_t = E_t + tile_BT2(dx,dy)
      end do
    end do
    E_t = E_t/(sbox_sz*sbox_sz)

    s_t = 0.0
    !$acc loop vector collapse(2) reduction(+:s_t)
    do dx = 1, sbox_sz
      do dy = 1, sbox_sz
        s_t = s_t + (tile_BT2(dx,dy)-E_t)**2
      end do
    end do
    s_t = sqrt(s_t/(sbox_sz*sbox_sz))

    !$acc loop collapse(2) vector reduction(max:corr1,corr2)
    do x = 1, bbox_sz-sbox_sz+1
      do y = 1, bbox_sz-sbox_sz+1

        ! correlate vs BT1
        E_s = 0.0
        do dx = 0, sbox_sz-1
          do dy = 0, sbox_sz-1
            E_s = E_s + box_BT1(x+dx, y+dy)
          end do
        end do
        E_s = E_s/(sbox_sz*sbox_sz)

        s_s  = 0.0
        E_ts = 0.0
        do dx = 0, sbox_sz-1
          do dy = 0, sbox_sz-1
            a1 = box_BT1(x+dx, y+dy) - E_s
            a2 = tile_BT2(dx+1,dy+1)  - E_t
            s_s  = s_s  + a1*a1
            E_ts = E_ts + a1*a2
          end do
        end do
        s_s  = sqrt(s_s/(sbox_sz*sbox_sz))
        E_ts = E_ts/(sbox_sz*sbox_sz)
        if (s_t>0 .and. s_s>0) corr1 = max(corr1, E_ts/(s_t*s_s))

        ! correlate vs BT3
        E_s = 0.0
        do dx = 0, sbox_sz-1
          do dy = 0, sbox_sz-1
            E_s = E_s + box_BT3(x+dx, y+dy)
          end do
        end do
        E_s = E_s/(sbox_sz*sbox_sz)

        s_s  = 0.0
        E_ts = 0.0
        do dx = 0, sbox_sz-1
          do dy = 0, sbox_sz-1
            a1 = box_BT3(x+dx, y+dy) - E_s
            a2 = tile_BT2(dx+1,dy+1)  - E_t
            s_s  = s_s  + a1*a1
            E_ts = E_ts + a1*a2
          end do
        end do
        s_s  = sqrt(s_s/(sbox_sz*sbox_sz))
        E_ts = E_ts/(sbox_sz*sbox_sz)
        if (s_t>0 .and. s_s>0) corr2 = max(corr2, E_ts/(s_t*s_s))

      end do
    end do

    CC11(1,i,j)=corr1; CC22(1,i,j)=corr2
    CC11(2,i,j)=real(px);  CC22(2,i,j)=real(px)
    CC11(3,i,j)=real(py);  CC22(3,i,j)=real(py)
    CC11(4,i,j)=real(ist2+sbox_sz/2)
    CC22(4,i,j)=real(js2+sbox_sz/2)

  end do
end do
!$acc end parallel

end subroutine

It’s difficult to determine performance issues from just a code snip-it, but since I can compile the code, let’s look at the compiler feedback messages:

% nvfortran -c test2.F90 -acc -Minfo=accel
NVFORTRAN-W-0155-Cached array section must have fixed size: tile_bt2 (test2.F90: 38)
NVFORTRAN-W-0155-Cached array section must have fixed size: box_bt3 (test2.F90: 48)
NVFORTRAN-W-0155-Cached array section must have fixed size: box_bt1 (test2.F90: 48)
NVFORTRAN-W-1042-acc loop vector clause ignored because of inner vector loop. (test2.F90: 22)
calculate_vd_acc_tiling:
     22, Generating present(bt1(:,:),bt2(:,:),lat(:,:),lon(:,:),cc22(:,:,:),bt3(:,:),cc11(:,:,:))
         Generating implicit firstprivate(bbox_sz,amv_n,sbox_sz)
         Generating NVIDIA GPU code
         27, !$acc loop gang collapse(2) ! blockidx%x
             Generating reduction(max:corr2,corr1)
         28,   ! blockidx%x collapsed
         40, !$acc loop vector(128) collapse(2) ! threadidx%x
         41,   ! threadidx%x collapsed
         50, !$acc loop vector(128) collapse(2) ! threadidx%x
         51,   ! threadidx%x collapsed
         61, !$acc loop vector(128) ! threadidx%x
             Generating reduction(+:e_t)
         62, !$acc loop seq
         70, !$acc loop vector(128) collapse(2) ! threadidx%x
             Generating reduction(+:s_t)
         71,   ! threadidx%x collapsed
         78, !$acc loop vector(128) collapse(2) ! threadidx%x
             Generating reduction(max:corr2,corr1)
         79,   ! threadidx%x collapsed
         83, !$acc loop seq
         84, !$acc loop seq
             Generating implicit reduction(+:e_s)
         92, !$acc loop seq
         93, !$acc loop seq
             Generating implicit reduction(+:s_s,e_ts)
        106, !$acc loop seq
        107, !$acc loop seq
             Generating implicit reduction(+:e_s)
        115, !$acc loop seq
        116, !$acc loop seq
             Generating implicit reduction(+:s_s,e_ts)
     22, Generating implicit copy(corr2,corr1) [if not already present]
     28, Generating implicit firstprivate(ov_flag,line_l1b)
     40, Loop is parallelizable
     41, Loop is parallelizable
     50, Loop is parallelizable
     51, Loop is parallelizable
     61, Loop is parallelizable
     62, Loop is parallelizable
     70, Loop is parallelizable
     71, Loop is parallelizable
     78, Loop is parallelizable
     79, Loop is parallelizable
     83, Loop is parallelizable
     84, Loop is parallelizable
     92, Loop is parallelizable
     93, Loop is parallelizable
    106, Loop is parallelizable
    107, Loop is parallelizable
    115, Loop is parallelizable
    116, Loop is parallelizable
  0 inform,   4 warnings,   0 severes, 0 fatal for calculate_vd_acc_tiling

As you can see from the warnings, the size of the cached arrays needs to be known at compile time. The problem being that the shared memory on the device is fixed, and exceeding this size will cause runtime errors, the compiler needs to know if it will fit.

As a side note, the compile will implicitly used shared memory when it can for gang-private arrays. So the cache directive typically isn’t needed.

Next, the outer “vector” clause is being ignored since you also use vector parallelization on the inner loops. If you really do want three levels of parallelism, then you’d use the middle “worker” dimension in place of “vector” on the outer loop. I doubt it would help here and more likely slow things down, so this is just an FYI rather than a recommendation.

Where you could employ worker is on the inner loops, thus allowing you to expose the vector parallelism on the dx/dy loops inside of the x/y loop (like on line 83). I.e. put “loop worker collapse(2)” on the x/y loops, and “vector” on the inner dx/dy loops. I’m not sure it would help, but easy to try.

My question to you is what are the values for “amv_n”, “sbox_sz” and “bbox_sz”?

If the “box” values are small, then you might be better off not parallelizing these loops, and keep “vector” just on the outermost loops. The reason is that inner loop reductions have a signification overhead, so if they are small, the benefit from parallelism may not outweigh the reduction overhead cost.

It’s easy to try by commenting out the inner directives and only parallelizing the outer i/j loops. The caveat being that you’ll be using more memory as each vector will now get a private copy of the temp arrays.

You may consider profiling the code with Nsight-Compute. This does a hardware profile on each kernel and can provide detailed performance information such as the actual occupancy, cache utilization, warp stalls, and other useful information.

Thank you! amv_n, sbox_sz, and bbox_sz represent the number of tracked blocks in the x and y directions, the size of each tracking block, and the size of the corresponding search region, respectively. Their typical values are 229, 24, and 54 or larger,maybe it’s big? The tracking blocks are evenly spaced and extracted from BT2, and the search for the maximum correlation coefficient is performed within the search region in BT1 and BT3.

24x24 is ok, but not really that big. With a vector_length of 128 (the default), each vector will do a partial reduction on 4 or 5 elements, with the final reduction, which is inserted after the loop, the 128 partial reductions are reduced.

You might try reducing the vector length to 64, to see if it improves. i.e.:

!$acc parallel loop collapse(2) gang vector_length(64)

Then each vector will reduce 9 elements, and the final reduction is halved.

1 Like

Thank you for your help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.