Hi everyone,
I’m working on accelerating a Fortran code using OpenACC, and I’ve encountered an issue while trying to compute the maximum correlation coefficient in a four-level nested loop structure. The outer two loops iterate over i
and j
, while the inner two loops iterate over x
and y
. The results seem incorrect when parallelizing the code with OpenACC.
Here’s a simplified version of the code:
!$acc data copyin(data1, data2, data3) copyout(result1, result2)
!$acc parallel loop collapse(2) private(sub2, istart2, jstart2)
do i = 1, ni
do j = 1, nj
max_cc1 = -1.0
max_cc2 = -1.0
istart2 = (i-1) * sbox_size/2 + 1
jstart2 = (j-1) * sbox_size/2 + 1
sub2 = data2(istart2:istart2+sbox_size-1, jstart2:jstart2+sbox_size-1)
do x = 1, bbox_size - sbox_size + 1
do y = 1, bbox_size - sbox_size + 1
istart1 = max(1, istart2 - (bbox_size-sbox_size)/2 + x - 1)
jstart1 = max(1, jstart2 - (bbox_size-sbox_size)/2 + y - 1)
if (istart1 + sbox_size - 1 <= boundaryi .and. jstart1 + sbox_size - 1 <= boundaryj) then
sub1 = data1(istart1:istart1+sbox_size-1, jstart1:jstart1+sbox_size-1)
sub3 = data3(istart1:istart1+sbox_size-1, jstart1:jstart1+sbox_size-1)
correlation1 = get_matrix_correlation_coef(sbox_size, sbox_size, sub2, sub1)
correlation2 = get_matrix_correlation_coef(sbox_size, sbox_size, sub2, sub3)
if (max_cc1 < correlation1) max_cc1 = correlation1
if (max_cc2 < correlation2) max_cc2 = correlation2
end if
end do
end do
result1(i, j) = max_cc1
result2(i, j) = max_cc2
end do
end do
!$acc end parallel
!$acc end data
Although the code runs without errors, the computed values of max_cc1
and max_cc2
are incorrect when the loops are parallelized using OpenACC. The results don’t match the expected output from the sequential (non-parallel) version of the code.
Here is the compilation output with OpenACC-related details:
compute_correlation:
94, Generating copyin(data1(:,:),data2(:,:),data3(:,:)) [if not already present]
Generating copyout(result1(:,:),result2(:,:)) [if not already present]
95, Generating NVIDIA GPU code
96, !$acc loop gang collapse(2) ! blockidx%x
97, ! blockidx%x collapsed
103, !$acc loop seq
!$acc loop vector(32) ! threadidx%x
105, !$acc loop seq
Generating implicit reduction(max:max_cc2,max_cc1)
106, !$acc loop seq
Generating implicit reduction(max:max_cc2,max_cc1)
112, !$acc loop vector(32) ! threadidx%x
!$acc loop seq
95, Generating implicit copy(sub3(:,:),sub1(:,:)) [if not already present]
97, Generating implicit firstprivate(bbox_size,max_cc1,max_cc2,x,sbox_size)
103, Loop is parallelizable
105, Loop carried dependence of sub1 prevents parallelization
Loop carried backward dependence of sub1 prevents vectorization
Complex loop carried dependence of sub1 prevents parallelization
Loop carried dependence of sub3 prevents parallelization
Loop carried backward dependence of sub3 prevents vectorization
Complex loop carried dependence of sub3,sub1 prevents parallelization
Generating implicit firstprivate(y)
Loop carried dependence of sub3 prevents parallelization
106, Loop carried dependence of sub1 prevents parallelization
Loop carried backward dependence of sub1 prevents vectorization
Loop carried dependence of sub3,sub1 prevents parallelization
Loop carried backward dependence of sub3 prevents vectorization
Generating implicit firstprivate(correlation2,jstart1,istart1,correlation1)
112, Loop is parallelizable
115, Reference argument passing prevents parallelization: sbox_size
116, Reference argument passing prevents parallelization: sbox_size
get_matrix_correlation_coef:
160, Generating implicit acc routine seq
Generating acc routine seq
Generating NVIDIA GPU code