Complex loop carried dependence

Hi Mat,

Thanks again for your tireless replies to the questions in this forum.

A number of issues are now coming up as I put OpenACC into the full application. I’ll try to bring these up as separate topics, and simply the specific issue as far as possible. Here is the first.

The following triply-nest loop is repeated in similar form multiple times in the application:

Several levels above the kernel in the program hierachy we have:

!$ACC DATA COPY( metrics_exner_ref_mc_d, z_exner_ex_pr_d, lots of other stuff)

:

Then:

c_startidx = GET_STARTIDX_C(rl_start,1)
c_endidx = GET_ENDIDX_C(rl_end, MAX(1,p_patch%n_childdom))

!$ACC KERNELS &
!$ACC PRESENT( metrics_exner_ref_mc_d, z_exner_ex_pr_d )
!$ACC LOOP GANG
DO jb = i_startblk, i_endblk

IF ( i_startblk == jb ) THEN; i_startidx = c_startidx; ELSE; i_startidx
= 1; ENDIF
IF ( i_endblk == jb ) THEN; i_endidx = c_endidx; ELSE; i_endidx = nprom
a; ENDIF


!$ACC LOOP VECTOR COLLAPSE(2) !!! BUG: COLLAPSE(2) causes 12.10 to crash!
DO jc = i_startidx, i_endidx
!DIR$ VECTOR
DO jk = 1, nlev
z_exner_ex_pr_d(jc,jk,jb) = - metrics_exner_ref_mc_d(jc,jk,jb)
ENDDO
ENDDO
ENDDO
!$ACC END KERNELS

First: if “COLLAPSE(2)” is present, the 12.10 compiler crashes with the error:


pgfortran-Fatal-/apps/castor/pgi-12.10/linux86-64/12.10/bin/pgf902 TERMINATED by sig
nal 11
Arguments to /apps/castor/pgi-12.10/linux86-64/12.10/bin/pgf902
/apps/castor/pgi-12.10/linux86-64/12.10/bin/pgf902 /tmp/pgfortranllqfHP_u9pbx.ilm -f
n …/…/…/src/atm_dyn_iconam/mo_solve_nonhydro.f90 -opt 3 -terse 1 -inform warn -x
51 0x20 -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000
-x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -vect 56
-y 34 16 -x 34 0x8 -x 32 12582912 -y 19 8 -y 35 0 -x 42 0x30 -x 39 0x40 -x 39 0x80 -
x 34 0x400000 -x 149 1 -x 150 1 -x 59 4 -x 59 4 -tp nehalem -x 120 0x1000 -x 124 0x1
400 -y 15 2 -x 57 0x3b0000 -x 58 0x48000000 -x 49 0x100 -x 120 0x200 -astype 0 -x 12
1 1 -x 124 1 -x 9 1 -x 42 0x14200000 -x 72 0x1 -x 136 0x11 -x 80 0x800000 -quad -x 1
19 0x10000000 -x 129 0x40000000 -x 129 2 -x 164 0x1000 -x 186 0x80000 -x 180 0x400 -
x 180 0x4000000 -x 163 0x1 -x 186 0x80000 -x 180 0x400 -x 180 0x4000000 -x 186 2 -ac
cel nvidia -x 176 0x140000 -x 177 0x0202007f -x 0 0x1000000 -x 2 0x100000 -x 0 0x200
0000 -x 161 16384 -x 162 16384 -cmdline ‘+pgfortran …/…/…/src/atm_dyn_iconam/mo_s
olve_nonhydro.f90 -I…/include -I…/…/…/src/include -I/apps/castor/zlib/1.2.7/gnu_
463/install/include -I/apps/castor/mvapich2/1.8.1/mvapich2-pgi/include -O3 -fastsse
-fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz -Mpre -I…/module -D__ICON__ -D_
_LOOP_EXCHANGE -DPGI_COMPILER -DNO_NETCDF -DDSL_INLINE= -acc -ta=nvidia -Minfo=accel
-Mpreprocess -c -I/apps/castor/mvapich2/1.8.1/mvapich2-pgi/include -I/apps/castor/m
vapich2/1.8.1/mvapich2-pgi/include’ -asm /tmp/pgfortrantlqf54gApNPC.sm

OK, the crash is obviously an issue, but it is clear from many other similar loops, that the compiler does not want to parallelize the i_startidx, i_endidx loop. If you look at the IF statements just above it, you will see that the look is almost rectangular, with only indentations for the first and last gang. If there is a way to express this which will help the compiler parallelize this loop, please let me know.

For what it is worth, the prototype CUDAFortran implementation for the same code is:

! loop through all patch cells (and blocks)
!
jb = blockidx%x + ( i_startblk -1 )
jc = threadidx%x
jk = threadidx%y ! [1 … nlev]

IF ( ( i_startblk < jb .and. jb < i_endblk ) .or. &
( i_startblk == jb .and. i_startidx <= jc ) .or. &
( i_endblk == jb .and. jc <= i_endidx ) ) THEN

! extrapolated perturbation Exner pressure (used for horizontal gradients only)
z_exner_ex_pr(jc,jk,jb) = - exner_ref_mc(jc,jk,jb)
ENDIF

And this works great (pity we cannot use CUDAFortran in the real application).

So if the COLLAPSE(2) is removed and we get the warning:

22, Complex loop carried dependence of ‘metrics_exner_ref_mc_d’ prevents parallelization
Complex loop carried dependence of ‘z_exner_ex_pr_d’ prevents parallelization
Inner sequential loop scheduled on accelerator
Loop is parallelizable

The “complex loop carried dependence” (CLCD) does not make sense to me. I’ve read some of your replies on CLCD when indirect addressing is involved and understand it in those cases, but that would not be the case here.

Of course, the actual calculation inside the loop is much more complicated, and I now have an OpenACC version (albeit with above CLCD warning) which produces valid results. I now want to remove the CLCD issue.

Thanks, --Will

Hi Will,

If you could send a reproducing example for the pgf902 segv to PGI Customer Service (trs@pgroup.com), we would appreciate it. Obviously it’s a major compiler issue that we’d like to get fixed. It would also help in determining the complex loop dependency as well.

My best guess is that the problems stem from the fact that the vector length is uniform for all gangs, but “jc” loop bound is variable. This may be causing an unexpected code path to be taken in the compiler. Though we can’t be sure until we can reproduce the error.

As a work around, you may try interchanging the “jk” and “jc” loops or push “jk” above the IF statements.

Something like:

!$ACC LOOP VECTOR 
DO jk = 1, nlev
DO jc = i_startidx, i_endidx
z_exner_ex_pr_d(jc,jk,jb) = - metrics_exner_ref_mc_d(jc,jk,jb)
ENDDO
ENDDO
ENDDO
!$ACC END KERNELS

or

!$ACC KERNELS &
!$ACC PRESENT( metrics_exner_ref_mc_d, z_exner_ex_pr_d )
DO jb = i_startblk, i_endblk
DO jk = 1, nlev

IF ( i_startblk == jb ) THEN; i_startidx = c_startidx; ELSE; i_startidx\
= 1; ENDIF
IF ( i_endblk == jb ) THEN; i_endidx = c_endidx; ELSE; i_endidx = nprom\
a; ENDIF

DO jc = i_startidx, i_endidx
z_exner_ex_pr_d(jc,jk,jb) = - metrics_exner_ref_mc_d(jc,jk,jb)
ENDDO
ENDDO
ENDDO
!$ACC END KERNELS
  • Mat