Problem with Finding Maximum in Four-Level Nested Loops in OpenACC

Hi everyone,

I’m working on accelerating a Fortran code using OpenACC, and I’ve encountered an issue while trying to compute the maximum correlation coefficient in a four-level nested loop structure. The outer two loops iterate over i and j, while the inner two loops iterate over x and y. The results seem incorrect when parallelizing the code with OpenACC.

Here’s a simplified version of the code:

!$acc data copyin(data1, data2, data3) copyout(result1, result2)
!$acc parallel loop collapse(2) private(sub2, istart2, jstart2)
do i = 1, ni
    do j = 1, nj
        max_cc1 = -1.0
        max_cc2 = -1.0
        istart2 = (i-1) * sbox_size/2 + 1
        jstart2 = (j-1) * sbox_size/2 + 1

        sub2 = data2(istart2:istart2+sbox_size-1, jstart2:jstart2+sbox_size-1)

        do x = 1, bbox_size - sbox_size + 1
            do y = 1, bbox_size - sbox_size + 1
                istart1 = max(1, istart2 - (bbox_size-sbox_size)/2 + x - 1)
                jstart1 = max(1, jstart2 - (bbox_size-sbox_size)/2 + y - 1)

                if (istart1 + sbox_size - 1 <= boundaryi .and. jstart1 + sbox_size - 1 <= boundaryj) then
                    sub1 = data1(istart1:istart1+sbox_size-1, jstart1:jstart1+sbox_size-1)
                    sub3 = data3(istart1:istart1+sbox_size-1, jstart1:jstart1+sbox_size-1)

                    correlation1 = get_matrix_correlation_coef(sbox_size, sbox_size, sub2, sub1)
                    correlation2 = get_matrix_correlation_coef(sbox_size, sbox_size, sub2, sub3)

                    if (max_cc1 < correlation1) max_cc1 = correlation1
                    if (max_cc2 < correlation2) max_cc2 = correlation2
                end if
            end do
        end do

        result1(i, j) = max_cc1
        result2(i, j) = max_cc2
    end do
end do
!$acc end parallel
!$acc end data

Although the code runs without errors, the computed values of max_cc1 and max_cc2 are incorrect when the loops are parallelized using OpenACC. The results don’t match the expected output from the sequential (non-parallel) version of the code.

Here is the compilation output with OpenACC-related details:

compute_correlation:
     94, Generating copyin(data1(:,:),data2(:,:),data3(:,:)) [if not already present]
         Generating copyout(result1(:,:),result2(:,:)) [if not already present]
     95, Generating NVIDIA GPU code
         96, !$acc loop gang collapse(2) ! blockidx%x
         97,   ! blockidx%x collapsed
        103, !$acc loop seq
             !$acc loop vector(32) ! threadidx%x
        105, !$acc loop seq
             Generating implicit reduction(max:max_cc2,max_cc1)
        106, !$acc loop seq
             Generating implicit reduction(max:max_cc2,max_cc1)
        112, !$acc loop vector(32) ! threadidx%x
             !$acc loop seq
     95, Generating implicit copy(sub3(:,:),sub1(:,:)) [if not already present]
     97, Generating implicit firstprivate(bbox_size,max_cc1,max_cc2,x,sbox_size)
    103, Loop is parallelizable
    105, Loop carried dependence of sub1 prevents parallelization
         Loop carried backward dependence of sub1 prevents vectorization
         Complex loop carried dependence of sub1 prevents parallelization
         Loop carried dependence of sub3 prevents parallelization
         Loop carried backward dependence of sub3 prevents vectorization
         Complex loop carried dependence of sub3,sub1 prevents parallelization
         Generating implicit firstprivate(y)
         Loop carried dependence of sub3 prevents parallelization
    106, Loop carried dependence of sub1 prevents parallelization
         Loop carried backward dependence of sub1 prevents vectorization
         Loop carried dependence of sub3,sub1 prevents parallelization
         Loop carried backward dependence of sub3 prevents vectorization
         Generating implicit firstprivate(correlation2,jstart1,istart1,correlation1)
    112, Loop is parallelizable
    115, Reference argument passing prevents parallelization: sbox_size
    116, Reference argument passing prevents parallelization: sbox_size
get_matrix_correlation_coef:
    160, Generating implicit acc routine seq
         Generating acc routine seq
         Generating NVIDIA GPU code

Hi and welcome!

Likely a race condition given arrays are shared by default and “sub1” and “sub3” need to be private to the inner loops.

Try something like the following:

!$acc parallel loop collapse(2) private(sub2, istart2, jstart2) firstprivate(sbox_size)
do i = 1, ni
    do j = 1, nj
        max_cc1 = -1.0
... cut ...
        sub2 = data2(istart2:istart2+sbox_size-1, jstart2:jstart2+sbox_size-1)

!$acc loop collapse(2) private(sub1,sub3) reduction(max:max_cc1,max_cc2)
        do x = 1, bbox_size - sbox_size + 1
            do y = 1, bbox_size - sbox_size + 1

Inner loop reductions can be costly performance-wise, so depending on the loop trip counts, you might also what to compare the performance to only parallelizing the outer loops:

!$acc parallel loop gang vector collapse(2) private(sub1, sub2, sub3, istart2, jstart2) firstprivate(sbox_size)
do i = 1, ni
    do j = 1, nj
        max_cc1 = -1.0
... cut ...
        sub2 = data2(istart2:istart2+sbox_size-1, jstart2:jstart2+sbox_size-1)

        do x = 1, bbox_size - sbox_size + 1
            do y = 1, bbox_size - sbox_size + 1

Also note the following compiler feedback message:

115, Reference argument passing prevents parallelization: sbox_size

The problem here is that in Fortran, arguments are passed by reference by default. This can cause issues in that the compiler wont be able to tell if the address gets taken in the subroutine, so must assume it does causing the scalar to get globalized. Not an issue here given “sbox_size” doesn’t get assigned to using the shared global value is ok, but I still prefer to fix these. There are two ways, use “firstprivate(sbox_size)” as I did above, or better yet, add the “value” attribute to “sbox_size” declaration in “get_matrix_correlation_coef”, so it’s passed by value.

Hope this helps,
Mat

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.