Hi everyone,
I’m using OpenACC to compute blockwise correlation coefficients between three input arrays (BT1
, BT2
, BT3
) and store the maximum correlation and its offset position for each output pixel. Here’s the relevant Fortran/OpenACC snippet:
!$acc data copyin(BT1,BT2,BT3) copyout(CC11,CC22)
!$acc parallel loop collapse(2) gang vector &
!$acc private(ist2,js2,px,py,E_t,s_t,E_s,s_s,E_ts) &
!$acc reduction(max:corr1,corr2)
do i = 1, amv_n
do j = 1, amv_n
ist2 = (i-1)*(sbox_sz/(1+ov_flag)) + 1
js2 = (j-1)*(sbox_sz/(1+ov_flag)) + 1
if (ist2<1 .or. ist2+sbox_sz-1>line_l1b) cycle
if (js2<1 .or. js2+sbox_sz-1>line_l1b) cycle
if (BT2(ist2+sbox_sz/2, js2+sbox_sz/2)<-500.0) cycle
E_t = 0.0
do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
E_t = E_t + BT2(ist2+dx, js2+dy)
end do; end do
E_t = E_t/(sbox_sz*sbox_sz)
s_t = 0.0
do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
s_t = s_t + (BT2(ist2+dx, js2+dy)-E_t)**2
end do; end do
s_t = sqrt(s_t/(sbox_sz*sbox_sz))
corr1 = -1.0
corr2 = -1.0
!$acc loop collapse(2) vector private(tx1,ty1, tx2,ty2)
do x = 0, bbox_sz-sbox_sz
do y = 0, bbox_sz-sbox_sz
px = ist2 + x - (bbox_sz-sbox_sz)/(1+ov_flag)
py = js2 + y - (bbox_sz-sbox_sz)/(1+ov_flag)
if (px<1 .or. py<1) cycle
if (px+sbox_sz-1>line_l1b .or. py+sbox_sz-1>line_l1b) cycle
E_s = 0.0
!$acc loop reduction(+:E_s)
do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
E_s = E_s + BT1(px+dx,py+dy)
end do; end do
E_s = E_s/(sbox_sz*sbox_sz)
s_s = 0.0
E_ts = 0.0
!$acc loop collapse(2) reduction(+:s_s,E_ts)
do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
s_s = s_s + (BT1(px+dx,py+dy)-E_s)**2
E_ts = E_ts + (BT2(ist2+dx,js2+dy)-E_t)*(BT1(px+dx,py+dy)-E_s)
end do; end do
s_s = sqrt(s_s/(sbox_sz*sbox_sz))
E_ts = E_ts/(sbox_sz*sbox_sz)
if (s_t>0.0 .and. s_s>0.0) then
corr1 = max(corr1, E_ts/(s_t*s_s))
tx1=x; ty1=y
end if
!================================
E_s = 0.0
!$acc loop reduction(+:E_s)
do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
E_s = E_s + BT3(px+dx,py+dy)
end do; end do
E_s = E_s/(sbox_sz*sbox_sz)
s_s = 0.0
E_ts = 0.0
!$acc loop collapse(2) reduction(+:s_s,E_ts)
do dx=0,sbox_sz-1; do dy=0,sbox_sz-1
s_s = s_s + (BT3(px+dx,py+dy)-E_s)**2
E_ts = E_ts + (BT2(ist2+dx,js2+dy)-E_t)*(BT3(px+dx,py+dy)-E_s)
end do; end do
s_s = sqrt(s_s/(sbox_sz*sbox_sz))
E_ts = E_ts/(sbox_sz*sbox_sz)
if (s_t>0.0 .and. s_s>0.0) then
corr2 = max(corr2, E_ts/(s_t*s_s))
tx2 = x;ty2 = y
end if
end do
end do
if (i==amv_n/4 .and. j==amv_n/4) then
print *, 'DEBUG device: corr1=',corr1,' corr2=',corr2
end if
CC11(1,i,j)=corr1; CC22(1,i,j)=corr2
CC11(2,i,j)=tx1; CC22(2,i,j)=ty1
CC11(3,i,j)=tx2; CC22(3,i,j)=ty2
end do
end do
!$acc end parallel
!$acc end data
I’m using reduction(max:corr1,corr2)
in nested OpenACC loops to compute maximum correlation values, and while corr1
/corr2
are correct, the associated index variables tx1
/ty1
remain zero on the host because the built-in max reduction only tracks values, not positions. How can I capture the argmax coordinates (x,y)
within these loops—must I resort to a two-pass scan or atomic capture
, or is there an OpenACC pattern that handles an index-aware reduction in one go?To illustrate the problem more clearly, I’ve uploaded the minimal test code and sample data to GitHub - random01203/maxcc_loc_test: this code is used for max correlation location test .please feel free to clone the repo and reproduce the issue.