How to capture positions within nested loops?

2992805762 · May 12, 2025, 6:25pm

Hi everyone,

I’m using OpenACC to compute blockwise correlation coefficients between three input arrays (BT1, BT2, BT3) and store the maximum correlation and its offset position for each output pixel. Here’s the relevant Fortran/OpenACC snippet:

!$acc data copyin(BT1,BT2,BT3) copyout(CC11,CC22)
!$acc parallel loop collapse(2) gang vector                     &
!$acc    private(ist2,js2,px,py,E_t,s_t,E_s,s_s,E_ts)             &
!$acc    reduction(max:corr1,corr2)
 do i = 1, amv_n
   do j = 1, amv_n

    ist2 = (i-1)*(sbox_sz/(1+ov_flag)) + 1
    js2 = (j-1)*(sbox_sz/(1+ov_flag)) + 1
    if (ist2<1 .or. ist2+sbox_sz-1>line_l1b) cycle
    if (js2<1 .or. js2+sbox_sz-1>line_l1b) cycle
    if (BT2(ist2+sbox_sz/2, js2+sbox_sz/2)<-500.0) cycle

    E_t = 0.0
    do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
      E_t = E_t + BT2(ist2+dx, js2+dy)
    end do; end do
    E_t = E_t/(sbox_sz*sbox_sz)

    s_t = 0.0
    do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
      s_t = s_t + (BT2(ist2+dx, js2+dy)-E_t)**2
    end do; end do
    s_t = sqrt(s_t/(sbox_sz*sbox_sz))


    corr1 = -1.0
    corr2 = -1.0


    !$acc loop collapse(2) vector private(tx1,ty1, tx2,ty2)
    do x = 0, bbox_sz-sbox_sz
      do y = 0, bbox_sz-sbox_sz
        px = ist2 + x - (bbox_sz-sbox_sz)/(1+ov_flag)
        py = js2 + y - (bbox_sz-sbox_sz)/(1+ov_flag)
        if (px<1 .or. py<1) cycle
        if (px+sbox_sz-1>line_l1b .or. py+sbox_sz-1>line_l1b) cycle

        E_s = 0.0
        !$acc loop reduction(+:E_s)
        do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
          E_s = E_s + BT1(px+dx,py+dy)
        end do; end do
        E_s = E_s/(sbox_sz*sbox_sz)

        s_s  = 0.0
        E_ts = 0.0
        !$acc loop collapse(2) reduction(+:s_s,E_ts)
        do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
          s_s  = s_s  + (BT1(px+dx,py+dy)-E_s)**2
          E_ts = E_ts + (BT2(ist2+dx,js2+dy)-E_t)*(BT1(px+dx,py+dy)-E_s)
        end do; end do
        s_s = sqrt(s_s/(sbox_sz*sbox_sz))
        E_ts = E_ts/(sbox_sz*sbox_sz)

        if (s_t>0.0 .and. s_s>0.0) then
          corr1 = max(corr1, E_ts/(s_t*s_s))
          tx1=x; ty1=y
        end if
        !================================
        E_s  = 0.0
        !$acc loop reduction(+:E_s)
        do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
          E_s = E_s + BT3(px+dx,py+dy)
        end do; end do
        E_s = E_s/(sbox_sz*sbox_sz)

        s_s  = 0.0
        E_ts = 0.0
        !$acc loop collapse(2) reduction(+:s_s,E_ts)
        do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
          s_s  = s_s  + (BT3(px+dx,py+dy)-E_s)**2
          E_ts = E_ts + (BT2(ist2+dx,js2+dy)-E_t)*(BT3(px+dx,py+dy)-E_s)
        end do; end do
        s_s  = sqrt(s_s/(sbox_sz*sbox_sz))
        E_ts = E_ts/(sbox_sz*sbox_sz)

        if (s_t>0.0 .and. s_s>0.0) then
          corr2 = max(corr2, E_ts/(s_t*s_s))
          tx2 = x;ty2 = y
        end if

      end do
    end do

    if (i==amv_n/4 .and. j==amv_n/4) then
      print *, 'DEBUG device: corr1=',corr1,' corr2=',corr2
    end if

    CC11(1,i,j)=corr1; CC22(1,i,j)=corr2
    CC11(2,i,j)=tx1; CC22(2,i,j)=ty1
    CC11(3,i,j)=tx2; CC22(3,i,j)=ty2


  end do
end do
!$acc end parallel
!$acc end data

I’m using reduction(max:corr1,corr2) in nested OpenACC loops to compute maximum correlation values, and while corr1/corr2 are correct, the associated index variables tx1/ty1 remain zero on the host because the built-in max reduction only tracks values, not positions. How can I capture the argmax coordinates (x,y) within these loops—must I resort to a two-pass scan or atomic capture, or is there an OpenACC pattern that handles an index-aware reduction in one go?To illustrate the problem more clearly, I’ve uploaded the minimal test code and sample data to GitHub - random01203/maxcc_loc_test: this code is used for max correlation location test .please feel free to clone the repo and reproduce the issue.

MatColgrove · May 12, 2025, 8:03pm

Unfortunately, there’s no way to capture this in a single pass. Each thread is working on a partial reduction and then the compiler needs to insert a second kernel call later to obtain the final reduction. It can’t track which location the max is coming from.

Instead, you need to perform a second pass to find the location.

2992805762 · May 14, 2025, 8:11am

Thank you very much for your relpy!

2992805762 · May 14, 2025, 3:40pm

Hello! I have another question I’d like to ask you. My current code takes quite a long time to run. I tried using cache to reduce variable reads and speed up the computation, but instead the runtime increased. Could you take a look and tell me what might be wrong with my approach? Is there any part of the code that could be causing such a significant slowdown? Thank you again for your help!

subroutine calculate_VD_acc_tiling(BT1, BT2, BT3, lon, lat,         &
                                 sbox_sz, bbox_sz, ov_flag,       &
                                 amv_n, line_l1b, dt,             &
                                 CC11, CC22)
implicit none
real, dimension(:,:), intent(in)        :: BT1, BT2, BT3, lon, lat
integer, intent(in)                     :: sbox_sz, bbox_sz, amv_n, line_l1b
integer(kind=1), intent(in)             :: ov_flag
real, intent(in)                        :: dt
real, dimension(4, amv_n, amv_n), intent(out) :: CC11, CC22

integer :: i, j, x, y, ist2, js2, px, py, boundary
real    :: E_t, s_t, corr1, corr2, E_s, s_s, E_ts
integer :: dx, dy, xi, yj
real    :: a1, a2

real, dimension(sbox_sz, sbox_sz) :: tile_BT2
real, dimension(bbox_sz, bbox_sz) :: box_BT1, box_BT3

boundary = line_l1b

!$acc parallel loop collapse(2) gang vector                             &
!$acc     private(ist2,js2,px,py,E_t,s_t,E_s,s_s,E_ts,dx,dy,xi,yj,a1,a2) &  
!$acc     private(tile_BT2,box_BT1,box_BT3)                             &  
!$acc     reduction(max:corr1,corr2)                                    &
!$acc     present(BT1,BT2,BT3,lon,lat,CC11,CC22)
do i = 1, amv_n
  do j = 1, amv_n

    ist2 = (i-1)*(sbox_sz/(1+ov_flag)) + 1
    js2 = (j-1)*(sbox_sz/(1+ov_flag)) + 1
    if (ist2 < 1 .or. ist2+sbox_sz-1 > line_l1b) cycle
    if (js2 < 1 .or. js2+sbox_sz-1 > line_l1b) cycle

    corr1 = -1.0
    corr2 = -1.0

    !$acc cache(tile_BT2)
    !$acc loop collapse(2) vector
    do dx = 0, sbox_sz-1
      do dy = 0, sbox_sz-1
        tile_BT2(dx+1,dy+1) = BT2(ist2+dx, js2+dy)
      end do
    end do

    px = ist2 - (bbox_sz - sbox_sz)/(1+ov_flag)
    py = js2 - (bbox_sz - sbox_sz)/(1+ov_flag)
    !$acc cache(box_BT1,box_BT3)
    !$acc loop collapse(2) vector
    do dx = 0, bbox_sz-1
      do dy = 0, bbox_sz-1
        xi = max(1, min(px+dx, line_l1b-sbox_sz+1))
        yj = max(1, min(py+dy, line_l1b-sbox_sz+1))
        box_BT1(dx+1,dy+1) = BT1(xi, yj)
        box_BT3(dx+1,dy+1) = BT3(xi, yj)
      end do
    end do

    E_t = 0.0
    !$acc loop vector reduction(+:E_t)
    do dx = 1, sbox_sz
      do dy = 1, sbox_sz
        E_t = E_t + tile_BT2(dx,dy)
      end do
    end do
    E_t = E_t/(sbox_sz*sbox_sz)

    s_t = 0.0
    !$acc loop vector collapse(2) reduction(+:s_t)
    do dx = 1, sbox_sz
      do dy = 1, sbox_sz
        s_t = s_t + (tile_BT2(dx,dy)-E_t)**2
      end do
    end do
    s_t = sqrt(s_t/(sbox_sz*sbox_sz))

    !$acc loop collapse(2) vector reduction(max:corr1,corr2)
    do x = 1, bbox_sz-sbox_sz+1
      do y = 1, bbox_sz-sbox_sz+1

        ! correlate vs BT1
        E_s = 0.0
        do dx = 0, sbox_sz-1
          do dy = 0, sbox_sz-1
            E_s = E_s + box_BT1(x+dx, y+dy)
          end do
        end do
        E_s = E_s/(sbox_sz*sbox_sz)

        s_s  = 0.0
        E_ts = 0.0
        do dx = 0, sbox_sz-1
          do dy = 0, sbox_sz-1
            a1 = box_BT1(x+dx, y+dy) - E_s
            a2 = tile_BT2(dx+1,dy+1)  - E_t
            s_s  = s_s  + a1*a1
            E_ts = E_ts + a1*a2
          end do
        end do
        s_s  = sqrt(s_s/(sbox_sz*sbox_sz))
        E_ts = E_ts/(sbox_sz*sbox_sz)
        if (s_t>0 .and. s_s>0) corr1 = max(corr1, E_ts/(s_t*s_s))

        ! correlate vs BT3
        E_s = 0.0
        do dx = 0, sbox_sz-1
          do dy = 0, sbox_sz-1
            E_s = E_s + box_BT3(x+dx, y+dy)
          end do
        end do
        E_s = E_s/(sbox_sz*sbox_sz)

        s_s  = 0.0
        E_ts = 0.0
        do dx = 0, sbox_sz-1
          do dy = 0, sbox_sz-1
            a1 = box_BT3(x+dx, y+dy) - E_s
            a2 = tile_BT2(dx+1,dy+1)  - E_t
            s_s  = s_s  + a1*a1
            E_ts = E_ts + a1*a2
          end do
        end do
        s_s  = sqrt(s_s/(sbox_sz*sbox_sz))
        E_ts = E_ts/(sbox_sz*sbox_sz)
        if (s_t>0 .and. s_s>0) corr2 = max(corr2, E_ts/(s_t*s_s))

      end do
    end do

    CC11(1,i,j)=corr1; CC22(1,i,j)=corr2
    CC11(2,i,j)=real(px);  CC22(2,i,j)=real(px)
    CC11(3,i,j)=real(py);  CC22(3,i,j)=real(py)
    CC11(4,i,j)=real(ist2+sbox_sz/2)
    CC22(4,i,j)=real(js2+sbox_sz/2)

  end do
end do
!$acc end parallel

end subroutine

MatColgrove · May 14, 2025, 4:45pm

It’s difficult to determine performance issues from just a code snip-it, but since I can compile the code, let’s look at the compiler feedback messages:

% nvfortran -c test2.F90 -acc -Minfo=accel
NVFORTRAN-W-0155-Cached array section must have fixed size: tile_bt2 (test2.F90: 38)
NVFORTRAN-W-0155-Cached array section must have fixed size: box_bt3 (test2.F90: 48)
NVFORTRAN-W-0155-Cached array section must have fixed size: box_bt1 (test2.F90: 48)
NVFORTRAN-W-1042-acc loop vector clause ignored because of inner vector loop. (test2.F90: 22)
calculate_vd_acc_tiling:
     22, Generating present(bt1(:,:),bt2(:,:),lat(:,:),lon(:,:),cc22(:,:,:),bt3(:,:),cc11(:,:,:))
         Generating implicit firstprivate(bbox_sz,amv_n,sbox_sz)
         Generating NVIDIA GPU code
         27, !$acc loop gang collapse(2) ! blockidx%x
             Generating reduction(max:corr2,corr1)
         28,   ! blockidx%x collapsed
         40, !$acc loop vector(128) collapse(2) ! threadidx%x
         41,   ! threadidx%x collapsed
         50, !$acc loop vector(128) collapse(2) ! threadidx%x
         51,   ! threadidx%x collapsed
         61, !$acc loop vector(128) ! threadidx%x
             Generating reduction(+:e_t)
         62, !$acc loop seq
         70, !$acc loop vector(128) collapse(2) ! threadidx%x
             Generating reduction(+:s_t)
         71,   ! threadidx%x collapsed
         78, !$acc loop vector(128) collapse(2) ! threadidx%x
             Generating reduction(max:corr2,corr1)
         79,   ! threadidx%x collapsed
         83, !$acc loop seq
         84, !$acc loop seq
             Generating implicit reduction(+:e_s)
         92, !$acc loop seq
         93, !$acc loop seq
             Generating implicit reduction(+:s_s,e_ts)
        106, !$acc loop seq
        107, !$acc loop seq
             Generating implicit reduction(+:e_s)
        115, !$acc loop seq
        116, !$acc loop seq
             Generating implicit reduction(+:s_s,e_ts)
     22, Generating implicit copy(corr2,corr1) [if not already present]
     28, Generating implicit firstprivate(ov_flag,line_l1b)
     40, Loop is parallelizable
     41, Loop is parallelizable
     50, Loop is parallelizable
     51, Loop is parallelizable
     61, Loop is parallelizable
     62, Loop is parallelizable
     70, Loop is parallelizable
     71, Loop is parallelizable
     78, Loop is parallelizable
     79, Loop is parallelizable
     83, Loop is parallelizable
     84, Loop is parallelizable
     92, Loop is parallelizable
     93, Loop is parallelizable
    106, Loop is parallelizable
    107, Loop is parallelizable
    115, Loop is parallelizable
    116, Loop is parallelizable
  0 inform,   4 warnings,   0 severes, 0 fatal for calculate_vd_acc_tiling

As you can see from the warnings, the size of the cached arrays needs to be known at compile time. The problem being that the shared memory on the device is fixed, and exceeding this size will cause runtime errors, the compiler needs to know if it will fit.

As a side note, the compile will implicitly used shared memory when it can for gang-private arrays. So the cache directive typically isn’t needed.

Next, the outer “vector” clause is being ignored since you also use vector parallelization on the inner loops. If you really do want three levels of parallelism, then you’d use the middle “worker” dimension in place of “vector” on the outer loop. I doubt it would help here and more likely slow things down, so this is just an FYI rather than a recommendation.

Where you could employ worker is on the inner loops, thus allowing you to expose the vector parallelism on the dx/dy loops inside of the x/y loop (like on line 83). I.e. put “loop worker collapse(2)” on the x/y loops, and “vector” on the inner dx/dy loops. I’m not sure it would help, but easy to try.

My question to you is what are the values for “amv_n”, “sbox_sz” and “bbox_sz”?

If the “box” values are small, then you might be better off not parallelizing these loops, and keep “vector” just on the outermost loops. The reason is that inner loop reductions have a signification overhead, so if they are small, the benefit from parallelism may not outweigh the reduction overhead cost.

It’s easy to try by commenting out the inner directives and only parallelizing the outer i/j loops. The caveat being that you’ll be using more memory as each vector will now get a private copy of the temp arrays.

You may consider profiling the code with Nsight-Compute. This does a hardware profile on each kernel and can provide detailed performance information such as the actual occupancy, cache utilization, warp stalls, and other useful information.

2992805762 · May 14, 2025, 4:56pm

Thank you! amv_n, sbox_sz, and bbox_sz represent the number of tracked blocks in the x and y directions, the size of each tracking block, and the size of the corresponding search region, respectively. Their typical values are 229, 24, and 54 or larger,maybe it’s big? The tracking blocks are evenly spaced and extracted from BT2, and the search for the maximum correlation coefficient is performed within the search region in BT1 and BT3.

MatColgrove · May 14, 2025, 6:05pm

24x24 is ok, but not really that big. With a vector_length of 128 (the default), each vector will do a partial reduction on 4 or 5 elements, with the final reduction, which is inserted after the loop, the 128 partial reductions are reduced.

You might try reducing the vector length to 64, to see if it improves. i.e.:

!$acc parallel loop collapse(2) gang vector_length(64)

Then each vector will reduce 9 elements, and the final reduction is halved.

2992805762 · May 14, 2025, 6:26pm

Thank you for your help!

system · May 28, 2025, 6:27pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nested loop and reductions nvc, nvc++ and nvfortran	4	935	October 12, 2021
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	76	December 28, 2024
Problem with Finding Maximum in Four-Level Nested Loops in OpenACC nvc, nvc++ and nvfortran kernel	2	34	December 9, 2024
Need advices for optimizing heart of CFD code Legacy PGI Compilers	11	7087	July 13, 2016
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2515	December 4, 2023
Acc loop with a FORTRAN subroutine call (acc routine worker) is not parallelized! Legacy PGI Compilers	18	1930	April 6, 2022
Tiled loop increase MSE error of a nlm denoise filter result compared to the result obtained when using collapse clause Legacy PGI Compilers	11	5472	May 22, 2020
What is the issue of different values between running the code in serial and run it using OpenACC? Legacy PGI Compilers	15	1547	December 4, 2020
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25098	December 11, 2014
How to parallelize this loop... Legacy PGI Compilers	14	7831	December 18, 2012

How to capture positions within nested loops?

Related topics