Complex loop carried dependence

wsawyer · December 21, 2015, 2:14pm

We are struggling to port a Fortran2003 radiation parameterization to OpenACC (see separate correspondence), and are now plodding through compiler warnings to determine why the GPU version yields different output. The compiler is flagging a “Complex loop carried dependence”, which in my mind has no reason to be:

  1941, Complex loop carried dependence of tmp_lay_src prevents parallelization
         Loop carried dependence of tmp_lay_src prevents parallelization
         Loop carried backward dependence of tmp_lay_src prevents vectorization

The ‘offensive’ code is:

  
   if (present(sfc_src)) then
      ilay = merge(1,nlay,play(1,1) > play(nlay,1)) ! surface layer index
      ALLOCATE( tmp_sfc_src( SIZE(sfc_src,1), SIZE(sfc_src,2) ) )
!$ACC DATA CREATE( tmp_sfc_src )
!$ACC PARALLEL PRESENT( pfrac )
!$ACC LOOP GANG
      do icol = 1, ncol
        ! tmp_sfc_src(icol,:) = pfrac(:,ilay,icol) * this%expand(plnksfc(:,icol))
        tmp_sfc_src(icol,:) = 0;
!$ACC LOOP VECTOR
        do iband=1,nband
          tmp_sfc_src(icol,this_band2gpt(1,iband):this_band2gpt(2,iband)) = &
            tmp_sfc_src(icol,this_band2gpt(1,iband):this_band2gpt(2,iband)) + &
            plnksfc(iband,icol)
        end do
        tmp_sfc_src(icol,:) = tmp_sfc_src(icol,:) * pfrac(:,ilay,icol)
      end do ! icol
!$ACC END PARALLEL
!$ACC UPDATE HOST( tmp_sfc_src )
      sfc_src = tmp_sfc_src
!$ACC END DATA
      DEALLOCATE( tmp_sfc_src )
    end if

Admittedly the array syntax might be confusing for the compiler, so I tried explicitly writing out the loops and evening making the inner loop SEQ, but all to no avail:

           do iband=1,nband
-            tmp_lay_src(icol,ilay,this_band2gpt(1,iband):this_band2gpt(2,iband)) = &
-              tmp_lay_src(icol,ilay,this_band2gpt(1,iband):this_band2gpt(2,iband)) + &
-              plnklay(ilay,iband,icol)
+!$ACC LOOP SEQ
+            do this_band=this_band2gpt(1,iband),this_band2gpt(2,iband)
+              tmp_lay_src(icol,ilay,this_band) = &
+                tmp_lay_src(icol,ilay,this_band) + &
+                plnklay(ilay,iband,icol)
+            end do
           end do
-          tmp_lay_src(icol,ilay,:) = tmp_lay_src(icol,ilay,:) * pfrac(:,ilay,icol)
+          do igpt=1,ngpt
+            tmp_lay_src(icol,ilay,igpt) = tmp_lay_src(icol,ilay,igpt) * pfrac(igpt,ilay,icol)
+          end do
         end do ! ilay

Three questions: (1) do you see any real dependency here? Conceptually, at least, there should be none. (2) How can one convince the compiler there is no dependence? And (3) Is there any reason to believe that this perceived dependence could be the reason the OpenACC code is yielding different results than the CPU-only code?

Thanks, --Will

MatColgrove · December 21, 2015, 4:54pm

Hi Will,

(1) do you see any real dependency here? Conceptually, at least, there should be none.

Are the variables pointers? If so, this would cause dependencies since the compiler must assume that the variables may overlap in memory.

(2) How can one convince the compiler there is no dependence?

In OpenACC, the work around would be to add a loop directive but this of course doesn’t work for array syntax given there are no explicit loops. The other option is to expand the loops (as you’ve done) so the loop construct can be applied.

Alternatively, if these arrays could be changed to be allocatables, then the compiler most likely can auto-parallelize the implicit array syntax loops.

I’ve put in a feature request so that OpenACC might allow for loop constructs on implicit loops such as these. Though if adopted in the standard, wont be available for some time.

(3) Is there any reason to believe that this perceived dependence could be the reason the OpenACC code is yielding different results than the CPU-only code?

Possible but difficult for me to tell for sure without analysis. My thought is that since the array assignment is at the gang level loop, all the vectors will execute the implicit array syntax loop redundantly. This could lead to a read/write race condition on tmp_lay_src since it appears on both the left and right-hand sides of the assignment.

My suggest on this code is to not expand the arrays but instead make the “icol” loop a gang vector:

!$ACC PARALLEL PRESENT( pfrac ) 
!$ACC LOOP GANG VECTOR
      do icol = 1, ncol

The compiler will still complain about the dependencies since it will try and auto-parallelize the inner loops (unless you use the flag “-acc=noautopar”), but it will only parallelize the “icol” loop.

The caveat being that you’ll loose some performance, especially if ncol is small, but hopefully will produce correct answers. You can then work on manually expanding the loops to see if you can exploit more parallelism.

Hope this helps,
Mat