Simple assignment not parallelizing in 18.7 - worked in 18.4

Hello,

I just installed the new 18.7 compiler and am testing my OpenACC Fortran code on it.

The first problem is that the following code (which parallelized in every compiler 18.4 and before) no longer does:

      allocate (v_shear_t(ntm,np))
      allocate (v_shear_p(nt,npm))
c
!$acc enter data create(v_shear_t,v_shear_p)
!$acc kernels default(present)
      v_shear_t=0.
      v_shear_p=0.
!$acc end kernels

The compiler now spits out:

  10237, Generating enter data create(v_shear_t(:,:),v_shear_p(:,:))
  10238, Generating implicit present(v_shear_t(:,:),v_shear_p(:,:))
  10239, Loop carried dependence due to exposed use of v_shear_t(:,:) prevents parallelization
         Parallelization would require privatization of array v_shear_t(:,:)
         Accelerator serial kernel generated
         Accelerator kernel generated
         Generating Tesla code
      10239, !$acc loop seq

I REALLY do not want to have to expand all these statements into explicit loops as there are MANY of them.
Can you please submit this to the engineers for the next update?
Thanks,
Ron

Hi Ron,

I tried to reproduce your issue here but it works fine for me. Can you please post a reproducing example or modify the one below to better show the error?

Thanks,
Mat

% cat test2.f90

function foo(nt,ntm,np,npm)

     integer :: nt,ntm,np,npm,i
     real(8) :: rc, foo
     double precision, dimension(:,:),allocatable :: v_shear_t, v_shear_p
     allocate (v_shear_t(ntm,np))
     allocate (v_shear_p(ntm,np))

!$acc enter data create(v_shear_t,v_shear_p)
!$acc kernels default(present)
      v_shear_t=0.
      v_shear_p=0.
!$acc end kernels

!$acc kernels loop
      do i=1,ntm
        do j=1,np
          rc = rc + v_shear_t(i,j)*v_shear_p(i,j)
        enddo
      enddo
      foo=rc
end function foo


sky4:/local/home/colgrove% pgf90 -c test2.f90 -ta=tesla:cc70 -Minfo=accel -V18.7
foo:
     10, Generating enter data create(v_shear_t(:,:),v_shear_p(:,:))
     11, Generating implicit present(v_shear_t(:,:),v_shear_p(:,:))
     12, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         12, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     13, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         13, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     16, Generating implicit copyin(v_shear_t(1:ntm,1:np),v_shear_p(1:ntm,1:np))
     17, Loop is parallelizable
     18, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         17, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         18, !$acc loop gang ! blockidx%y
         19, Generating implicit reduction(+:rc)

Hi,
Using your test code I have figured out that the problem only happens when the arrays are from a module as follows:

PGI-2018: ~/Desktop/bugtest1 $ cat test3.f 
      module shear_profile
      real(8),dimension(:,:),allocatable :: v_shear_t, v_shear_p
      end module

      function foo(nt,ntm,np,npm)
      use shear_profile

      integer :: nt,ntm,np,npm,i
      real(8) :: rc, foo
      allocate (v_shear_t(ntm,np))
      allocate (v_shear_p(ntm,np))

!$acc enter data create(v_shear_t,v_shear_p)
!$acc kernels default(present)
      v_shear_t=0.
      v_shear_p=0.
!$acc end kernels

!$acc kernels loop default(present)
      do i=1,ntm
        do j=1,np
          rc = rc + v_shear_t(i,j)*v_shear_p(i,j)
        enddo
      enddo
      foo=rc

      end function foo 
PGI-2018: ~/Desktop/bugtest1 $ pgf90 -O3 -c test3.f -ta=tesla:cc60 -Minfo=accel
foo:
      0, Accelerator kernel generated
         Generating Tesla code
     15, Generating enter data create(v_shear_t(:,:),v_shear_p(:,:))
     16, Generating implicit present(v_shear_t(:,:),v_shear_p(:,:))
     17, Loop carried dependence due to exposed use of v_shear_t(:,:) prevents parallelization
         Parallelization would require privatization of array v_shear_t(:,:)
         Accelerator serial kernel generated
         Accelerator kernel generated
         Generating Tesla code
         17, !$acc loop seq
     18, Loop carried dependence due to exposed use of v_shear_p(:,:) prevents parallelization
         Parallelization would require privatization of array v_shear_p(:,:)
         Accelerator serial kernel generated
         Accelerator kernel generated
         Generating Tesla code
         18, !$acc loop seq
     21, Generating implicit present(v_shear_t(1:ntm,1:np),v_shear_p(1:ntm,1:np))
     22, Loop is parallelizable
     23, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         22, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         23, !$acc loop gang ! blockidx%y
         24, Generating implicit reduction(+:rc)
PGI-2018: ~/Desktop/bugtest1 $

As a thought - have the rules of when it is necessary to use “declare” changed in OpenACC 3.0? As far as I understand, it is only needed if the array is being used in a function within a compute region. Is this correct?

Hi,

Any progress on this issue?

The lack of parallelization in this case is killing my code’s performance. Running my code using 18.7 is 25% slower than running it with 18.4.

Thanks!

  • Ron

Hi Ron,

Apologies that I missed you’re follow-up post. I was flying back from China that day.

I just added a problem report (TPR#26335) and sent it off to engineering.

Unfortunately I don’t have a good work around for you other than to make these explicit loops instead of using array syntax. If there are only a couple of arrays, it should be easy to make the switch, but if there are many, it will be a pain.

-Mat

% cat ron.F90
      module shear_profile
      real(8),dimension(:,:),allocatable :: v_shear_t, v_shear_p
      end module

      function foo(nt,ntm,np,npm)
      use shear_profile

      integer :: nt,ntm,np,npm,i
      real(8) :: rc, foo

      allocate (v_shear_t(ntm,np))
      allocate (v_shear_p(ntm,np))

!$acc enter data create(v_shear_t,v_shear_p)
!$acc kernels default(present)
      do i=1,ntm
        do j=1,np
          v_shear_t(i,j)=0.
          v_shear_p(i,j)=0.
        enddo
      enddo
!$acc end kernels

!$acc kernels loop default(present)
      do i=1,ntm
        do j=1,np
          rc = rc + v_shear_t(i,j)*v_shear_p(i,j)
        enddo
      enddo
      foo=rc

      end function foo

% pgf90 -ta=tesla:cc70 -Minfo=accel -c ron.F90 -V18.7
foo:
     14, Generating enter data create(v_shear_t(:,:),v_shear_p(:,:))
     15, Generating implicit present(v_shear_p(1:ntm,1:np),v_shear_t(1:ntm,1:np))
     16, Loop is parallelizable
     17, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         16, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         17, !$acc loop gang ! blockidx%y
     24, Generating implicit present(v_shear_t(1:ntm,1:np),v_shear_p(1:ntm,1:np))
     25, Loop is parallelizable
     26, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         25, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         26, !$acc loop gang ! blockidx%y
         27, Generating implicit reduction(+:rc)

Hi Ron,

I got an update for engineering:

Although the allocatable array assignment within an accelerator region is mostly reverting to Fortran 95 behavior in device code, we are still seeing a performance impact (compiler fails to parallelize the code) due to the compiler created allocate, reallocate, deallocate, or conformability calls. We are actively working on the fix for the future releases.

For more information about Fortran 2003 Allocatable Arrays please see:

-Mat

Thanks for the update.

Great blog post!

I have changed all my “a=b” statements to a(:,:,:)=b(:,:,:) and it looks like the code is parallelizing all statements.

FYI for others coming to this thread, the issue should be fixed in versions 19.5 and above.