Openacc loop/omp target teams loop optimization error

Hello!

I have a loop that looks as follows:

        !$acc parallel loop collapse(2) &
        !$acc   present(out, G, G%scale) &
        !$acc   private(du, ddu, err, deriv, uadj, f, df, live, k, itt, i, j)
        do j = G%jsc, G%jec
            do i = G%isc-1, G%iec
                du = 0.0_dp; err = 1.0_dp; deriv = 1.0e6_dp
                do itt = 1, max_itt
                    ddu = -err / deriv
                    du = du + ddu 
                    if (abs(ddu) < 1.0e-15_dp * abs(du)) exit
                    err = 0.0_dp; deriv = 0.0_dp
                    do k = 1, nk
                        uadj = 0.01_dp + du
                        ! PATTERN: live set in if/else, consumed after via G%scale
                        if (uadj > 0.0_dp) then
                            f    = G%scale(i,j) * uadj * 10.0_dp
                            live = 10.0_dp       ! set in if-branch
                        else
                            f    = -G%scale(i,j) * uadj * 10.0_dp
                            live = -10.0_dp      ! set in else-branch
                        end if
                        df    = G%scale(i,j) * live   ! live consumed after if/else
                        err   = err + f 
                        deriv = deriv + df
                    end do
                end do
                out(i,j) = du
            end do
        end do

This patter will fail with a

Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution

However, this only happens when compiling the code with -O2 and higher. It happens both with acc parallel loop and omp target teams loop but not with omp target teams distribute parallel do. I have a small reproducible in the following repo. I have tested with multiple nvhpc versions 25.5 → 26.1 and they all present this error.

Thank for the report JorgeG94 and the nice MRE.

I filed a problem report, TPR #38301, and sent it to engineering for investigation.

My best guess is that it’s a problem with scalar replacement. “G%scale(i,j)” is used in three spots in the inner two loops with “i” and “j” corresponding to the outer loop indices. Hence the compiler can store the array value to a scalar and then reuse this scaler in the inner loops so it doesn’t need to refetch it each time.

Hence a work around is to manually add this scalar replacement.

#ifdef SCALAR_REPLACE
        real(dp) :: Gscaleij
#endif
        !$acc parallel loop collapse(2) &
        !$acc   present(out, G, G%scale) &
        !$acc   private(du, ddu, err, deriv, uadj, f, df, live, k, itt, i, j)
        do j = G%jsc, G%jec
            do i = G%isc-1, G%iec
                du = 0.0_dp; err = 1.0_dp; deriv = 1.0e6_dp
#ifdef SCALAR_REPLACE
                Gscaleij = G%scale(i,j)
#endif
                do itt = 1, max_itt
                    ddu = -err / deriv
                    du = du + ddu
                    if (abs(ddu) < 1.0e-15_dp * abs(du)) exit
                    err = 0.0_dp; deriv = 0.0_dp
                    do k = 1, nk
                        uadj = 0.01_dp + du
                        ! PATTERN: live set in if/else, consumed after via G%scale
                        if (uadj > 0.0_dp) then
#ifdef SCALAR_REPLACE
                            f    = Gscaleij * uadj * 10.0_dp
#else
                            f    = G%scale(i,j) * uadj * 10.0_dp
#endif
                            live = 10.0_dp       ! set in if-branch
                        else
#ifdef SCALAR_REPLACE
                            f    = -Gscaleij * uadj * 10.0_dp
#else
                            f    = -G%scale(i,j) * uadj * 10.0_dp
#endif
                            live = -10.0_dp      ! set in else-branch
                        end if
#ifdef SCALAR_REPLACE
                        df    = Gscaleij * live   ! live consumed after if/else
#else
                        df    = G%scale(i,j) * live   ! live consumed after if/else
#endif
                        err   = err + f
                        deriv = deriv + df
                    end do
                end do
                out(i,j) = du
            end do
        end do
% nvfortran -O2 -gopt mre_final.F90 -o mre_acc -acc -DSCALAR_REPLACE ; ./mre_acc            result =  -1.0000E-02
PASS

-Mat

You sir are a genius. Thanks so much Mat.