Complex loop that worked in 18.4 not accel in 18.7

Hi,

Here is another example of code not being parallelized in 18.7 when it was in 18.4 and before.

The loop is:

!$acc parallel default(present) present(fj,b)
!$acc loop
      do k=2,npm1
!$acc loop
        do j=2,ntm1
!$acc loop
          do i=1,nrm1
            fj%r(i,j,k)=( ( st(j  )*b%p(i,j  ,k)                     -st
     &(j-1)*b%p(i,j-1,k))*dth_i(j)                   -(b%t(i,j,k)-b%t(i,
     &j,k-1))*dp_mult*dph_i(k)                  )*r_i(i)*sth_i(j)
          enddo
        enddo
      enddo
c
!$acc loop
      do k=2,npm1
!$acc loop
        do j=jm0,jm1
          do i=2,nrm1
        fj%t(i,j,k)=( (b%r(i,j,k)-b%r(i,j,k-1))*dp_mult*dph_i(k)*st_i(j)
     &                   -( r(i  )*b%p(i  ,j,k)                     -r(i
     &-1)*b%p(i-1,j,k))*drh_i(i)                  )*rh_i(i)
          enddo
          fj%t( 1,j,k)=(fj%t(2,j,k)+(fj%t(2,j,k)-fj%t(2+1,j,k))*dr(2-1)*
     &dr_i(2))
          fj%t(nr,j,k)=(fj%t(nrm1,j,k)+(fj%t(nrm1,j,k)-fj%t(nrm1-1,j,k))
     &*dr(nrm1)*dr_i(nrm1-1))
        enddo
      enddo
...

and the compiler says:

  18274, Generating present(fj)
         Generating implicit present(dr_i(:),st(1:ntm1),r_i(1:nrm1),dr(:),dph_i(2:npm1),sth_i(2:ntm1),st_i(jm0:jm1),r(1:nrm1),drh_i(2:nrm1),rh_i(2:nrm1))
         Generating present(b)
  18278, Loop is parallelizable
  18280, Loop is parallelizable
  18291, Loop is parallelizable
  18292, Complex loop carried dependence of b%r$p,b%p$p,r,fj%t$p prevents parallelization
  18307, Loop is parallelizable
  18308, Complex loop carried dependence of b%t$p,r,b%r$p,fj%p$p prevents parallelization
set_pole_bc_avec_acc:

I know having those two boundary lines in the second loop is strange but it seemed to work before. Is there a better way to do this?

  • Ron

Hi Ron,

Can you post the complete compiler feedback messages for this loop? Also, what are the line numbers for this code? (So I can correlate them to the feedback).

My assumption is that the “Complex loop carried dependence” messages are for the loop which don’t have a “acc loop” directive on them. Hence, the compiler is applying loop dependency analysis but since the variables are pointers, it’s can’t auto parallelize them.

What’s missing from the information you posted is if the compiler still successfully offloaded and parallelized the loops decorated with “acc loop”.

-Mat

Hi,
There is no additional compiler feedback for the loops in the code. The code is:

 18274	!$acc parallel default(present) present(fj,b)
 18275	!$acc loop
 18276	      do k=2,npm1
 18277	!$acc loop
 18278	        do j=2,ntm1
 18279	!$acc loop
 18280	          do i=1,nrm1
 18281	            fj%r(i,j,k)=( ( st(j  )*b%p(i,j  ,k)                     -st
 18282	     &(j-1)*b%p(i,j-1,k))*dth_i(j)                   -(b%t(i,j,k)-b%t(i,
 18283	     &j,k-1))*dp_mult*dph_i(k)                  )*r_i(i)*sth_i(j)
 18284	          enddo
 18285	        enddo
 18286	      enddo
 18287	c
 18288	!$acc loop
 18289	      do k=2,npm1
 18290	!$acc loop
 18291	        do j=jm0,jm1
 18292	          do i=2,nrm1
 18293	        fj%t(i,j,k)=( (b%r(i,j,k)-b%r(i,j,k-1))*dp_mult*dph_i(k)*st_i(j)
 18294	     &                   -( r(i  )*b%p(i  ,j,k)                     -r(i
 18295	     &-1)*b%p(i-1,j,k))*drh_i(i)                  )*rh_i(i)
 18296	          enddo
 18297	          fj%t( 1,j,k)=(fj%t(2,j,k)+(fj%t(2,j,k)-fj%t(2+1,j,k))*dr(2-1)*
 18298	     &dr_i(2))
 18299	          fj%t(nr,j,k)=(fj%t(nrm1,j,k)+(fj%t(nrm1,j,k)-fj%t(nrm1-1,j,k))
 18300	     &*dr(nrm1)*dr_i(nrm1-1))
 18301	        enddo
 18302	      enddo
 18303	c
 18304	!$acc loop
 18305	      do k=1,npm1
 18306	!$acc loop
 18307	        do j=2,ntm1
 18308	          do i=2,nrm1
 18309	            fj%p(i,j,k)=( ( r(i  )*b%t(i  ,j,k)                     -r(i
 18310	     &-1)*b%t(i-1,j,k))*drh_i(i)                   -(b%r(i,j,k)-b%r(i,j-
 18311	     &1,k))*dth_i(j)                  )*rh_i(i)
 18312	          enddo
 18313	          fj%p( 1,j,k)=(fj%p(2,j,k)+(fj%p(2,j,k)-fj%p(2+1,j,k))*dr(2-1)*
 18314	     &dr_i(2))
 18315	          fj%p(nr,j,k)=(fj%p(nrm1,j,k)+(fj%p(nrm1,j,k)-fj%p(nrm1-1,j,k))
 18316	     &*dr(nrm1)*dr_i(nrm1-1))
 18317	        enddo
 18318	      enddo
 18319	!$acc end parallel

You are correct that I believe this is still running on the GPU, and it is the non “acc” loops that it is complaining about ( I just checked my 18.4 output and it is the same).
This is an issue I remember wanting to mention which is that if I do not put “acc loop” on a loop in a parallel region, shouldn’y the compile NOT try to parallelize it? I understand that in “kernels” it is expected for the compiler to do what it can automatically, but since parallel is more descriptive of what I really want, I do not think it should be trying to parallelize loops that I have not specified with “acc loop”.

  • Ron