Add OpenACC to a Fortran loop

Huang_Wei · December 2, 2015, 8:40pm

Hello,

I am new to OpenACC programming. I am trying to start accelerate a loop below, and there is something wrong.

It is within a MPI program, it compiled OK, and after run it, the results are wrong/different to the one without OpenAcc.

Here nCells = 14548, and nVertLevels = 41.

I appreciate your help.

Wei

!$acc data copyin(EdgesOnCell,nEdgesOnCell,defc_a,defc_b,u,v), copyout(kdiff)
!$acc kernels
do iCell = 1, nCells
do k=1, nVertLevels
d_diag = 0.
d_off_diag = 0.
do iEdge = 1, nEdgesOnCell(iCell)
d_diag = d_diag + defc_a(iEdge,iCell)*u(k,EdgesOnCell(iEdge,iCell)) &

defc_b(iEdge,iCell)*v(k,EdgesOnCell(iEdge,iCell))
d_off_diag = d_off_diag + defc_b(iEdge,iCell)*u(k,EdgesOnCell(iEdge,iCell)) &

defc_a(iEdge,iCell)v(k,EdgesOnCell(iEdge,iCell))
end do
! here is the Smagorinsky formulation,
! followed by imposition of an upper bound on the eddy viscosity
kdiff(k,iCell) = (c_s * config_len_disp)2 * sqrt(d_diag2 + d_off_diag**2)
!kdiff(k,iCell) = min(kdiff(k,iCell),(0.01config_len_disp**2)/dt)
!kdiff(k,iCell) = min(kdiff(k,iCell),config_len_disp_over_dt)
if(kdiff(k,iCell) < config_len_disp_over_dt) then
kdiff(k,iCell) = config_len_disp_over_dt
end if
end do
end do
!$acc end kernels
!$acc end data

MatColgrove · December 2, 2015, 9:09pm

Hi Wei,

What are the compiler feedback messages (-Minfo=accel)? This might give us some clues as to the issue.

Mat

Huang_Wei · December 3, 2015, 2:33pm

Hello Mat,

Here is the compile info with -Minfo=accel.

mpif90 -D_MPI -DUNDERSCORE -DCORE_ATMOSPHERE -DMPAS_NAMELIST_SUFFIX=atmosphere -DMPAS_EXE_NAME=atmosphere_model -DMPAS_GIT_VERSION=unknown -DDO_PHYSICS -acc -Minfo=accel -Mprof=time -r8 -O3 -byteswap
io -Mfree -c mpas_atm_time_integration.F -I/opt/pgi/PIO/include -I/opt/pgi/include -I/opt/pgi/linux86-64/2015/netcdf/include -I/opt/pgi/PIO/include -I/opt/pgi/include -I/opt/pgi/linux86-64/2015/netcd
f/include -I…/…/framework -I…/…/operators -I…/physics -I…/physics/physics_wrf -I…/…/external/esmf_time_f90
atm_compute_dyn_tend:
2667, Generating copyin(edgesoncell(:,:),nedgesoncell(:),defc_a(:,:),defc_b(:,:),u(:,:),v(:,:))
Generating copyout(kdiff(:,:))
2668, Accelerator kernel generated
Generating Tesla code
2669, !$acc loop gang ! blockidx%x
2673, !$acc loop vector(128) ! threadidx%x
2674, Sum reduction generated for d_diag
2676, Sum reduction generated for d_off_diag
2670, Complex loop carried dependence of nedgesoncell,config_len_disp,kdiff,edgesoncell,u,defc_a,v,defc_b prevents parallelization
2673, Loop is parallelizable

By the way, this is large MPI code. Can I just apply OpenACC directives like this to individual loops?

I tried to find a complete MPI OpenACC hybrid example, but did not find one.

Thanks,

Wei[quote][/quote]

Huang_Wei · December 3, 2015, 2:42pm

Mat,

After compile, here is the error message:

nCells = 14548 nEdges = 44083
nCellsSolve = 13679 nEdgesSolve = 41062
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_compute_dyn_tend line=2668 device=0 threadid=1 num_gangs=14548 num_workers=1 vector_length=128 grid=14548 block=128 shared memory=1024
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Thanks,

Wei

MatColgrove · December 3, 2015, 7:14pm

Can I just apply OpenACC directives like this to individual loops?

Yes. It’s one of the benefits of OpenACC. It can be added incrementally, loop by loop. The only issue is that you may see sub-optimal performance due to excessive data movement until you add higher level data management.

2669, !$acc loop gang ! blockidx%x
2673, !$acc loop vector(128) ! threadidx%x
2674, Sum reduction generated for d_diag
2676, Sum reduction generated for d_off_diag
2670, Complex loop carried dependence of nedgesoncell,config_len_disp,kdiff,edgesoncell,u,defc_a,v,defc_b prevents parallelization

From this output we see that the compiler has parallelized the “iCell” and “iEdge” loop but runs the “k” loop sequentially due to loop dependencies.

What I’d like you try is changing your directive to:

!$acc kernels
!$acc loop gang vector collapse(2) independent
do iCell = 1, nCells
do k=1, nVertLevels
d_diag = 0.
d_off_diag = 0.
!$acc loop seq
do iEdge = 1, nEdgesOnCell(iCell)

The “seq” loop is probably unnecessary but I added it for illustration.

By default, the compiler attempts to parallelize all loops and will vectorize the innermost loop. However given your algorithm and the fact that the value of “nCells” is large, I think it’s better to just parallelize the outer two loops and run the inner loop sequentially. Reductions can be expensive, in particular for small inner loops, and it’s sometimes better to run these sequentially.

I added “independent” to assert to the compiler that the “k” loop can be parallelized. I do find it odd that the compiler complained about dependencies, unless you’re using pointers in which case the compiler has to assume that kDiff could point to the same memory as another pointer.

I collapse the “iCell” and “k” loops since I want “vector” to apply to the “k” loop but since it’s smaller than a good vector length (41<128), it’s better to collapse the loops together rather than making “iCell” use a “gang” schedule and “k” solely use “vector”. “vector” loops should correspond to the stride-1 dimension to allow for coalesced memory access.

the results are wrong/different to the one without OpenAcc.

The wrong answers could be due to the inner loop reductions being performed in parallel since parallel reductions can lead to divergent answers. But it could be bad code generation by the compiler as well. Without a reproducer, it’s difficult for me to tell. Though hopefully the adjusted schedule above will solve the issue. If not, I may ask for a reproducing example and/or additional information.

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Typically this is caused by an out-of-bounds errors, a host address being used, too large of a dynamic allocation from the device causing the heap to be exhausted, etc.

My best guess in this case is that it’s compiler code generation issue. However if it persists after you apply the explicit schedule listed above, consider checking for out-of-bounds errors in your host code (I like to use Valgrind to find these).

Mat

Huang_Wei · December 3, 2015, 10:02pm

Mat,

I applied directives:
!$acc kernels
!$acc loop gang vector collapse(2) independent

to other simpler loops, and it worked.

This loop still have the error:
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

I’ll check further.

Thanks for your help.

Wei

Topic		Replies	Views
parallelize a fortran loop. Legacy PGI Compilers	7	6547	December 11, 2015
Parallelizing a loop Legacy PGI Compilers	9	5569	March 1, 2016
a 3 levels of loop Legacy PGI Compilers	1	2093	September 6, 2012
acc kernels / acc parallel question Legacy PGI Compilers	2	3915	September 1, 2017
How to parallelize this loop... Legacy PGI Compilers	14	7950	December 18, 2012
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25243	December 11, 2014
Question on how to accelerate a complex loop Legacy PGI Compilers	3	1594	March 25, 2019
Need advice for OpenACC directives Legacy PGI Compilers	6	7374	July 5, 2016
The result changes, after adding the openacc statement Legacy PGI Compilers	11	3947	July 21, 2018
should use to "acc reduction" in an inner loop Legacy PGI Compilers	4	4255	December 6, 2012

Add OpenACC to a Fortran loop

Related topics