We are struggling to port a Fortran2003 radiation parameterization to OpenACC (see separate correspondence), and are now plodding through compiler warnings to determine why the GPU version yields different output. The compiler is flagging a “Complex loop carried dependence”, which in my mind has no reason to be:
1941, Complex loop carried dependence of tmp_lay_src prevents parallelization
Loop carried dependence of tmp_lay_src prevents parallelization
Loop carried backward dependence of tmp_lay_src prevents vectorization
The ‘offensive’ code is:
if (present(sfc_src)) then
ilay = merge(1,nlay,play(1,1) > play(nlay,1)) ! surface layer index
ALLOCATE( tmp_sfc_src( SIZE(sfc_src,1), SIZE(sfc_src,2) ) )
!$ACC DATA CREATE( tmp_sfc_src )
!$ACC PARALLEL PRESENT( pfrac )
!$ACC LOOP GANG
do icol = 1, ncol
! tmp_sfc_src(icol,:) = pfrac(:,ilay,icol) * this%expand(plnksfc(:,icol))
tmp_sfc_src(icol,:) = 0;
!$ACC LOOP VECTOR
do iband=1,nband
tmp_sfc_src(icol,this_band2gpt(1,iband):this_band2gpt(2,iband)) = &
tmp_sfc_src(icol,this_band2gpt(1,iband):this_band2gpt(2,iband)) + &
plnksfc(iband,icol)
end do
tmp_sfc_src(icol,:) = tmp_sfc_src(icol,:) * pfrac(:,ilay,icol)
end do ! icol
!$ACC END PARALLEL
!$ACC UPDATE HOST( tmp_sfc_src )
sfc_src = tmp_sfc_src
!$ACC END DATA
DEALLOCATE( tmp_sfc_src )
end if
Admittedly the array syntax might be confusing for the compiler, so I tried explicitly writing out the loops and evening making the inner loop SEQ, but all to no avail:
do iband=1,nband
- tmp_lay_src(icol,ilay,this_band2gpt(1,iband):this_band2gpt(2,iband)) = &
- tmp_lay_src(icol,ilay,this_band2gpt(1,iband):this_band2gpt(2,iband)) + &
- plnklay(ilay,iband,icol)
+!$ACC LOOP SEQ
+ do this_band=this_band2gpt(1,iband),this_band2gpt(2,iband)
+ tmp_lay_src(icol,ilay,this_band) = &
+ tmp_lay_src(icol,ilay,this_band) + &
+ plnklay(ilay,iband,icol)
+ end do
end do
- tmp_lay_src(icol,ilay,:) = tmp_lay_src(icol,ilay,:) * pfrac(:,ilay,icol)
+ do igpt=1,ngpt
+ tmp_lay_src(icol,ilay,igpt) = tmp_lay_src(icol,ilay,igpt) * pfrac(igpt,ilay,icol)
+ end do
end do ! ilay
Three questions: (1) do you see any real dependency here? Conceptually, at least, there should be none. (2) How can one convince the compiler there is no dependence? And (3) Is there any reason to believe that this perceived dependence could be the reason the OpenACC code is yielding different results than the CPU-only code?
Thanks, --Will