Add OpenACC to a Fortran loop

Hello,

I am new to OpenACC programming. I am trying to start accelerate a loop below, and there is something wrong.

It is within a MPI program, it compiled OK, and after run it, the results are wrong/different to the one without OpenAcc.

Here nCells = 14548, and nVertLevels = 41.

I appreciate your help.

Wei


!$acc data copyin(EdgesOnCell,nEdgesOnCell,defc_a,defc_b,u,v), copyout(kdiff)
!$acc kernels
do iCell = 1, nCells
do k=1, nVertLevels
d_diag = 0.
d_off_diag = 0.
do iEdge = 1, nEdgesOnCell(iCell)
d_diag = d_diag + defc_a(iEdge,iCell)*u(k,EdgesOnCell(iEdge,iCell)) &

  • defc_b(iEdge,iCell)*v(k,EdgesOnCell(iEdge,iCell))
    d_off_diag = d_off_diag + defc_b(iEdge,iCell)*u(k,EdgesOnCell(iEdge,iCell)) &
  • defc_a(iEdge,iCell)v(k,EdgesOnCell(iEdge,iCell))
    end do
    ! here is the Smagorinsky formulation,
    ! followed by imposition of an upper bound on the eddy viscosity
    kdiff(k,iCell) = (c_s * config_len_disp)2 * sqrt(d_diag2 + d_off_diag**2)
    !kdiff(k,iCell) = min(kdiff(k,iCell),(0.01
    config_len_disp**2)/dt)
    !kdiff(k,iCell) = min(kdiff(k,iCell),config_len_disp_over_dt)
    if(kdiff(k,iCell) < config_len_disp_over_dt) then
    kdiff(k,iCell) = config_len_disp_over_dt
    end if
    end do
    end do
    !$acc end kernels
    !$acc end data

Hi Wei,

What are the compiler feedback messages (-Minfo=accel)? This might give us some clues as to the issue.

  • Mat

Hello Mat,

Here is the compile info with -Minfo=accel.

mpif90 -D_MPI -DUNDERSCORE -DCORE_ATMOSPHERE -DMPAS_NAMELIST_SUFFIX=atmosphere -DMPAS_EXE_NAME=atmosphere_model -DMPAS_GIT_VERSION=unknown -DDO_PHYSICS -acc -Minfo=accel -Mprof=time -r8 -O3 -byteswap
io -Mfree -c mpas_atm_time_integration.F -I/opt/pgi/PIO/include -I/opt/pgi/include -I/opt/pgi/linux86-64/2015/netcdf/include -I/opt/pgi/PIO/include -I/opt/pgi/include -I/opt/pgi/linux86-64/2015/netcd
f/include -I…/…/framework -I…/…/operators -I…/physics -I…/physics/physics_wrf -I…/…/external/esmf_time_f90
atm_compute_dyn_tend:
2667, Generating copyin(edgesoncell(:,:),nedgesoncell(:),defc_a(:,:),defc_b(:,:),u(:,:),v(:,:))
Generating copyout(kdiff(:,:))
2668, Accelerator kernel generated
Generating Tesla code
2669, !$acc loop gang ! blockidx%x
2673, !$acc loop vector(128) ! threadidx%x
2674, Sum reduction generated for d_diag
2676, Sum reduction generated for d_off_diag
2670, Complex loop carried dependence of nedgesoncell,config_len_disp,kdiff,edgesoncell,u,defc_a,v,defc_b prevents parallelization
2673, Loop is parallelizable


By the way, this is large MPI code. Can I just apply OpenACC directives like this to individual loops?

I tried to find a complete MPI OpenACC hybrid example, but did not find one.


Thanks,

Wei[quote][/quote]

Mat,

After compile, here is the error message:

nCells = 14548 nEdges = 44083
nCellsSolve = 13679 nEdgesSolve = 41062
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_compute_dyn_tend line=2668 device=0 threadid=1 num_gangs=14548 num_workers=1 vector_length=128 grid=14548 block=128 shared memory=1024
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution


Thanks,

Wei

Can I just apply OpenACC directives like this to individual loops?

Yes. It’s one of the benefits of OpenACC. It can be added incrementally, loop by loop. The only issue is that you may see sub-optimal performance due to excessive data movement until you add higher level data management.

2669, !$acc loop gang ! blockidx%x
2673, !$acc loop vector(128) ! threadidx%x
2674, Sum reduction generated for d_diag
2676, Sum reduction generated for d_off_diag
2670, Complex loop carried dependence of nedgesoncell,config_len_disp,kdiff,edgesoncell,u,defc_a,v,defc_b prevents parallelization

From this output we see that the compiler has parallelized the “iCell” and “iEdge” loop but runs the “k” loop sequentially due to loop dependencies.

What I’d like you try is changing your directive to:

!$acc kernels
!$acc loop gang vector collapse(2) independent
do iCell = 1, nCells
do k=1, nVertLevels
d_diag = 0.
d_off_diag = 0.
!$acc loop seq
do iEdge = 1, nEdgesOnCell(iCell)

The “seq” loop is probably unnecessary but I added it for illustration.

By default, the compiler attempts to parallelize all loops and will vectorize the innermost loop. However given your algorithm and the fact that the value of “nCells” is large, I think it’s better to just parallelize the outer two loops and run the inner loop sequentially. Reductions can be expensive, in particular for small inner loops, and it’s sometimes better to run these sequentially.

I added “independent” to assert to the compiler that the “k” loop can be parallelized. I do find it odd that the compiler complained about dependencies, unless you’re using pointers in which case the compiler has to assume that kDiff could point to the same memory as another pointer.

I collapse the “iCell” and “k” loops since I want “vector” to apply to the “k” loop but since it’s smaller than a good vector length (41<128), it’s better to collapse the loops together rather than making “iCell” use a “gang” schedule and “k” solely use “vector”. “vector” loops should correspond to the stride-1 dimension to allow for coalesced memory access.

the results are wrong/different to the one without OpenAcc.

The wrong answers could be due to the inner loop reductions being performed in parallel since parallel reductions can lead to divergent answers. But it could be bad code generation by the compiler as well. Without a reproducer, it’s difficult for me to tell. Though hopefully the adjusted schedule above will solve the issue. If not, I may ask for a reproducing example and/or additional information.

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Typically this is caused by an out-of-bounds errors, a host address being used, too large of a dynamic allocation from the device causing the heap to be exhausted, etc.

My best guess in this case is that it’s compiler code generation issue. However if it persists after you apply the explicit schedule listed above, consider checking for out-of-bounds errors in your host code (I like to use Valgrind to find these).

  • Mat

Mat,

I applied directives:
!$acc kernels
!$acc loop gang vector collapse(2) independent

to other simpler loops, and it worked.

This loop still have the error:
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

I’ll check further.

Thanks for your help.

Wei