Bug report for nvfortran: arrays not privatizing correctly inside procedure

peterukk1 · April 6, 2021, 8:25am

Hello,

I found what I believe is a bug in nvfortran 21.2. I have a Fortran code with an outermost gang loop, inside which worker subroutines (!$acc routine worker) are called. One of these apparently failed to privatize the temporary arrays inside the worker subroutine, resulting in a race condition and wrong results, despite declaring these arrays as private for the worker loop. When I inline this subroutine manually, the code works as intended.

I will provide the subroutine below. The input-output arguments are declared as private for the outer gang loop in the higher-level code. Let me know if you need more information!

pure subroutine sw_source_adding_2str_nocol(ngpt, nlay, top_at_1, &
                            Rdif, Tdif, Rdir, Tdir, Tnoscat, &
                            sfc_albedo, sfc_albedo_dif, &
                            flux_up, flux_dn, flux_dn_dir)
    !$acc routine worker                                         
    integer,                           intent(in   )  :: ngpt, nlay
    logical(wl),                       intent(in   )  :: top_at_1
    real(wp), dimension(ngpt,nlay),    intent(in)     :: Rdif, Tdif, Rdir, Tdir, Tnoscat
    real(wp), dimension(ngpt        ), intent(in   )  :: sfc_albedo, sfc_albedo_dif          ! surface albedo for direct radiation
    real(wp), dimension(ngpt, nlay+1), intent(out)    :: flux_up
    real(wp), dimension(ngpt, nlay+1), intent(inout)  :: flux_dn, flux_dn_dir ! Direct beam flux
                                                                    ! intent(inout) because top layer includes incident flux
    integer :: igpt, ilev
    ! ---------------------------------
    real(wp), dimension(nlay+1) :: albedo!, &  ! reflectivity to diffuse radiation below this level
                                              ! alpha in SH08
    real(wp), dimension(nlay  ) :: denom      ! beta in SH08
    real(wp), dimension(nlay+1)  :: source ! source of diffuse upwelling radiation from emission or
                                          ! scattering of direct beam. G in SH08
    real(wp), dimension(nlay )  :: source_dn, source_up
    real(wp)                    :: source_sfc         
    ! ---------------------------------

    ! Big parallelizable loop over first dimension of input-output arrays. Results in race condition!
    !$acc loop worker vector private(albedo,denom,source,source_dn,source_up,source_sfc,ilev)
    do igpt = 1, ngpt 
        ! Non-parallelizable computations over nlay

        !$acc loop seq
        do ilev = 1, nlay
            source_up(ilev)     =    Rdir(igpt,ilev) * flux_dn_dir(igpt,ilev)
            source_dn(ilev)     =    Tdir(igpt,ilev) * flux_dn_dir(igpt,ilev)
            flux_dn_dir(igpt,ilev+1) = Tnoscat(igpt,ilev) * flux_dn_dir(igpt,ilev)
        end do

        source_sfc = flux_dn_dir(igpt,nlay+1)*sfc_albedo(igpt)

        ilev = nlay + 1
        albedo(ilev)  = sfc_albedo_dif(igpt)
        source(ilev) = source_sfc

        !$acc loop seq
        do ilev = nlay, 1, -1
            denom(ilev) = 1._wp/(1._wp - Rdif(igpt,ilev)*albedo(ilev+1))    ! Eq 10
            albedo(ilev) = Rdif(igpt,ilev) + &
                Tdif(igpt,ilev)*Tdif(igpt,ilev) * albedo(ilev+1) * denom(ilev) ! Equation 9

            source(ilev) =  source_up( ilev) + &
                            Tdif(igpt,ilev) * denom(ilev) *       &
                            (source(ilev+1) + albedo(ilev+1)*source_dn(ilev))
        end do
        ilev = 1
        flux_up(igpt,ilev) = flux_dn(igpt,ilev) * albedo(ilev) + & ! ... reflection of incident diffuse and
                                    source(ilev)                                  ! emission from below      
        do ilev = 2, nlay+1
            flux_dn(igpt,ilev) = (Tdif(igpt,ilev-1)*flux_dn(igpt,ilev-1) + &  ! Equation 13
                                Rdif(igpt,ilev-1)*source(ilev) +       &
                                source_dn(ilev-1)) * denom(ilev-1)
            flux_up(igpt,ilev) = flux_dn(igpt,ilev) * albedo(ilev) + & ! Equation 12
                            source(ilev)
        end do

    end do

  end subroutine sw_source_adding_2str_nocol

MatColgrove · April 6, 2021, 3:35pm

Hi peterukk1,

I took a look at the generated GPU code and do see the allocation and use of the private arrays in the routine. Hence it’s unclear if this the problem, or something else. Can you provide a minimal reproducing example that I could build and run to determine the issue?

Note that using private arrays in a device routine does require the compiler to allocate these array. Device side allocation can be quite slow and the device heap is very small so is easily overflowed (and a possible cause of your issue). Inlining can correct this as well since the compiler can then hoist the allocation of the private arrays so this is performed before entering the compute region.

-Mat

peterukk1 · April 7, 2021, 6:49am

Thanks, that’s good to know. I did run into memory errors a lot before making the inner dimensions known at compile time. I did not realize that exhausting device heap memory can result in the program executing but giving the wrong results?

If device side allocation is slow, how should I avoid this with OpenACC? Avoid device routines altogether or pass the temporary arrays as arguments, and create them before the parallel region?

Unfortunately this code is one part of a rather big program and I’m not immediately sure how to provide a minimal working example with verifiable results

MatColgrove · April 7, 2021, 5:02pm

Besides private arrays in routines, also avoid using automatics and explicit allocation. Both are supported, just not recommended.

Though I just looked at the generate device code again and think the allocations I’m seeing are for the automatic arrays, not the private arrays. So you may be correct that there is a race condition going on. I’ll try to write a reproducer that I can then send to engineering for review.

Avoid device routines altogether or pass the temporary arrays as arguments, and create them before the parallel region?

This is actually the first code I’ve seen that’s attempted to use private automatic arrays in a worker/vector subroutine. But for automatics, yes, this is what I’ve done before, where I’ll pass in the array but manually privatize it.

To “manually privatize”, I’ll add an extra dimension to the shared array so that each index has it’s own element. So here it may mean adding either adding two dimensions, one for the outer gang loop and another for the “ngpt” loops. Or possibly just one dimension for the “ngpt” loop and then privatize the array at the gang level. The latter is what I’d try first.

Something like:

real(wp), dimension(ngpt, nlay )  :: source_dn, source_up, denom
real(wp), dimension(ngpt,nlay+1) :: albedom, source

…

!$acc parallel loop gang private(albedo,denom,source,source_dn,source_up)
do j=....
...
     call sw_source_adding_2str_nocol(,,,, , albedo,denom,source,source_dn,source_up)
end do
...
pure subroutine sw_source_adding_2str_nocol(ngpt, nlay, top_at_1, &
                            Rdif, Tdif, Rdir, Tdir, Tnoscat, &
                            sfc_albedo, sfc_albedo_dif, &
                            flux_up, flux_dn, flux_dn_dir, albedo,denom,source,source_dn,source_up )
    !$acc routine worker                                         
    integer,                           intent(in   )  :: ngpt, nlay
...
    ! Big parallelizable loop over first dimension of input-output arrays. Results in race condition!
    !$acc loop worker vector 
    do igpt = 1, ngpt 
        ! Non-parallelizable computations over nlay

        !$acc loop seq
        do ilev = 1, nlay
            source_up(igpt,ilev)     =    Rdir(igpt,ilev) * flux_dn_dir(igpt,ilev)
            source_dn(igpt,ilev)     =    Tdir(igpt,ilev) * flux_dn_dir(igpt,ilev)
            flux_dn_dir(igpt,ilev+1) = Tnoscat(igpt,ilev) * flux_dn_dir(igpt,ilev)
        end do

Not 100% sure this will work effectively, but hopefully will get you around the race condition and performance problems associated with device side allocation.

peterukk1 · April 8, 2021, 3:15pm

Thanks a lot Mat, this is very useful. Your method worked and was indeed fastest!

I also tried manual privatize but flipping the dimensions to (nlev,ngpt) to avoid non-unit stride access in the innermost sequantial ilev loops, but that was very slow. Why is that?

By the way, my original method of writing the arrays as 1D with (nlay), inlining and privatizing with openacc over the worker vector loop was very slow unless I made the array bounds nlay and ngpt compile time constants. Then it worked OK and was just a tad slower than your method.

To “manually privatize”, I’ll add an extra dimension to the shared array so that each index has it’s own element. So here it may mean adding either adding two dimensions, one for the outer gang loop and another for the “ngpt” loops

That is more what the code originally looks like, with 3D temporary arrays, and these big arrays are passed to different procedures with do parallel collapse loops in each procedure. The code does radiative transfer where the first and last dimension are parallelizable, but for most computations the middle ilev dimension is not due to a dependency. What I was trying to do was improve performance by putting the computations inside an outermost gang loop and having 2D temporary arrays. I thought it would be faster due to reduced memory use and better locality. But actually it is precisely as fast as the original code (somewhat disappointingly). It seems I need to do more reading!

MatColgrove · April 8, 2021, 4:04pm

You want the stride-1 dimension associated with the loop index of the vector loop (ngpt) in order to get coalesced accesses.

By the way, my original method of writing the arrays as 1D with (nlay), inlining and privatizing with openacc over the worker vector loop was very slow unless I made the array bounds nlay and ngpt compile time constants. Then it worked OK and was just a tad slower than your method.

I can’t say for sure, but by making the array sizes fixed, while no longer allocated, this probably increased the register usage thus lowering the occupancy.

peterukk1 · April 11, 2021, 9:00am

Thanks!