Cache directive with derived type problem

Hi,

I am trying to do this:

!$acc parallel default(present) present(ps) async(1)
!$acc loop
      do k=2,npm1
!$acc loop
        do j=2,ntm1
!$acc loop
          do i=2,nrm-1
!$acc cache(ps%r(i,j,k),ps%t(i,j,k),ps%p(i,j,k))
            ii=ntm2*(nrm-2)*(k-2)+(nrm-2)*(j-2)+(i-1)
            q(ii)=a_r( i,j,k,1)*ps%r(i  ,j  ,k-1)
     &           +a_r( i,j,k,2)*ps%r(i  ,j-1,k  )
     &           +a_r( i,j,k,3)*ps%r(i-1,j  ,k  )
     &           +a_r( i,j,k,4)*ps%r(i  ,j  ,k  )
     &           +a_r( i,j,k,5)*ps%r(i+1,j  ,k  )
     &           +a_r( i,j,k,6)*ps%r(i  ,j+1,k  )
     &           +a_r( i,j,k,7)*ps%r(i  ,j  ,k+1)
     &           +a_r( i,j,k,8)*ps%t(i  ,j-1,k  )
     &           +a_r( i,j,k,9)*ps%t(i+1,j-1,k  )
     &           +a_r(i,j,k,10)*ps%t(i  ,j  ,k  )
     &           +a_r(i,j,k,11)*ps%t(i+1,j  ,k  )
     &           +a_r(i,j,k,12)*ps%p(i  ,j  ,k-1)
     &           +a_r(i,j,k,13)*ps%p(i+1,j  ,k-1)
     &           +a_r(i,j,k,14)*ps%p(i  ,j  ,k  )
     &           +a_r(i,j,k,15)*ps%p(i+1,j  ,k  )
          enddo
        enddo
      enddo
!$acc end parallel

and am getting this error:

PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Could not find allocated-variable index for symbol (mas_sed_expmac.f: 23885)

but then I also see this:

  23885, Generating present(ps)
         Accelerator kernel generated
         Generating Tesla code
      23887, !$acc loop gang ! blockidx%x
      23889, !$acc loop seq
      23891, !$acc loop vector(128) ! threadidx%x
             Cached references to size [(x)] block of t,r,p
  23889, Loop is parallelizable
  23891, Loop is parallelizable

It seems the cache doesn’t like my derived type arrays…

I would ask for the whole file mas_sed_expmac.f, but it looks like it
is huge.

If you could send the function/subroutine with line 23885, along
with the sources to any modules or headers the function/subroutine uses, there
is a chance we could get this into a path to correction.

We can’t compile what you sent. Not enough there.

Also would like to know the output of

pgfortran -V ! to get the cpu type.

and what your failing compile line looks like.

You should be able to successfully compile the file with and w/o
-acc in the compile line.


dave

Hi,

The full routine is:

      subroutine one_minus_div_grad_v (ps,q)
c
      use number_types
      use types
      use globals
      use matrix_storage_v_solve
c
      implicit none
c
      type(vvec) :: ps
      real(r_typ), dimension(N_cgvec) :: q
c
      integer :: i,j,k,ii
c
!$acc parallel default(present) present(ps) async(1)
!$acc loop
      do k=2,npm1
!$acc loop
        do j=2,ntm1
!$acc loop
          do i=2,nrm-1
!$acc cache(ps%r(i,j,k),ps%t(i,j,k),ps%p(i,j,k))
            ii=ntm2*(nrm-2)*(k-2)+(nrm-2)*(j-2)+(i-1)
            q(ii)=a_r( i,j,k,1)*ps%r(i  ,j  ,k-1)
     &           +a_r( i,j,k,2)*ps%r(i  ,j-1,k  )
     &           +a_r( i,j,k,3)*ps%r(i-1,j  ,k  )
     &           +a_r( i,j,k,4)*ps%r(i  ,j  ,k  )
     &           +a_r( i,j,k,5)*ps%r(i+1,j  ,k  )
     &           +a_r( i,j,k,6)*ps%r(i  ,j+1,k  )
     &           +a_r( i,j,k,7)*ps%r(i  ,j  ,k+1)
     &           +a_r( i,j,k,8)*ps%t(i  ,j-1,k  )
     &           +a_r( i,j,k,9)*ps%t(i+1,j-1,k  )
     &           +a_r(i,j,k,10)*ps%t(i  ,j  ,k  )
     &           +a_r(i,j,k,11)*ps%t(i+1,j  ,k  )
     &           +a_r(i,j,k,12)*ps%p(i  ,j  ,k-1)
     &           +a_r(i,j,k,13)*ps%p(i+1,j  ,k-1)
     &           +a_r(i,j,k,14)*ps%p(i  ,j  ,k  )
     &           +a_r(i,j,k,15)*ps%p(i+1,j  ,k  )
          enddo
        enddo
      enddo
!$acc end parallel
c
!$acc parallel default(present) present(ps) async(2)
!$acc loop
      do k=2,npm1
!$acc loop
        do j=2,ntm-1
!$acc loop
          do i=2,nrm1
!$acc cache(ps%r(i,j,k),ps%t(i,j,k),ps%p(i,j,k))
            ii=(npm2*ntm2*(nrm-2))
     &         +(ntm-2)*nrm2*(k-2)+nrm2*(j-2)+(i-1)
            q(ii)=
     &           a_t(i,j,k, 1)*ps%r(i-1,j  ,k  )
     &          +a_t(i,j,k, 2)*ps%r(i  ,j  ,k  )
     &          +a_t(i,j,k, 3)*ps%r(i-1,j+1,k  )
     &          +a_t(i,j,k, 4)*ps%r(i  ,j+1,k  )
     &          +a_t(i,j,k, 5)*ps%t(i  ,j  ,k-1)
     &          +a_t(i,j,k, 6)*ps%t(i  ,j-1,k  )
     &          +a_t(i,j,k, 7)*ps%t(i-1,j  ,k  )
     &          +a_t(i,j,k, 8)*ps%t(i  ,j  ,k  )
     &          +a_t(i,j,k, 9)*ps%t(i+1,j  ,k  )
     &          +a_t(i,j,k,10)*ps%t(i  ,j+1,k  )
     &          +a_t(i,j,k,11)*ps%t(i  ,j  ,k+1)
     &          +a_t(i,j,k,12)*ps%p(i  ,j  ,k-1)
     &          +a_t(i,j,k,13)*ps%p(i  ,j+1,k-1)
     &          +a_t(i,j,k,14)*ps%p(i  ,j  ,k  )
     &          +a_t(i,j,k,15)*ps%p(i  ,j+1,k  )
          enddo
        enddo
      enddo
!$acc end parallel
c
!$acc parallel default(present) present(ps) async(3)
!$acc loop
      do k=2,npm-1
!$acc loop
        do j=2,ntm1
!$acc loop
          do i=2,nrm1
!$acc cache(ps%r(i,j,k),ps%t(i,j,k),ps%p(i,j,k))
            ii=(npm2*ntm2*(nrm-2))+(npm2*(ntm-2)*nrm2)
     &         +ntm2*nrm2*(k-2)+nrm2*(j-2)+(i-1)
            q(ii)=
     &            a_p(i,j,k, 1)*ps%r(i-1,j  ,k  )
     &           +a_p(i,j,k, 2)*ps%r(i  ,j  ,k  )
     &           +a_p(i,j,k, 3)*ps%r(i-1,j  ,k+1)
     &           +a_p(i,j,k, 4)*ps%r(i  ,j  ,k+1)
     &           +a_p(i,j,k, 5)*ps%t(i  ,j-1,k  )
     &           +a_p(i,j,k, 6)*ps%t(i  ,j  ,k  )
     &           +a_p(i,j,k, 7)*ps%t(i  ,j-1,k+1)
     &           +a_p(i,j,k, 8)*ps%t(i  ,j  ,k+1)
     &           +a_p(i,j,k, 9)*ps%p(i  ,j  ,k-1)
     &           +a_p(i,j,k,10)*ps%p(i  ,j-1,k  )
     &           +a_p(i,j,k,11)*ps%p(i-1,j  ,k  )
     &           +a_p(i,j,k,12)*ps%p(i  ,j  ,k  )
     &           +a_p(i,j,k,13)*ps%p(i+1,j  ,k  )
     &           +a_p(i,j,k,14)*ps%p(i  ,j+1,k  )
     &           +a_p(i,j,k,15)*ps%p(i  ,j  ,k+1)
          enddo
        enddo
      enddo
!$acc end parallel
c
!$acc wait
c
      end subroutine

The relevant types are:

      module number_types
c
      use iso_fortran_env
c
      implicit none
c
      integer, parameter :: KIND_REAL_8=REAL64
c
      integer, private, parameter :: r8=KIND_REAL_8
c
      integer, parameter :: r_typ=r8
      end module



      module types
c
      use number_types
c
      implicit none
c
      type :: vvec
        real(r_typ), dimension(:,:,:), allocatable :: r !(nrm,nt,np)
        real(r_typ), dimension(:,:,:), allocatable :: t !(nr,ntm,np)
        real(r_typ), dimension(:,:,:), allocatable :: p !(nr,nt,npm)
      end type
      end module

The a_r, a_t, and a_p are simple allocatable arrays in the matrix module. Their sizes are:

      allocate (a_r(2:nrm-1, 2:ntm1,  2:npm1  ,15))
      allocate (a_t(2:nrm1,  2:ntm-1, 2:npm1  ,15))
      allocate (a_p(2:nrm1,  2:ntm1,  2:npm-1 ,15))

The value of N_cgvec is:

      N_vr=(nrm-2)*ntm2*npm2
      N_vt=nrm2*(ntm-2)*npm2
      N_vp=nrm2*ntm2*(npm-2)
c
      N_cgvec=N_vr+N_vt+N_vp

I forgot the other info:

PREDSCI-GPU2: ~/Dropbox/PSI/MAS/MAS_SVN_LOCAL_BRANCHES/mas_openacc/axidx $ pgfortran -V

pgfortran 17.9-0 64-bit target on x86-64 Linux -tp haswell 
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

The failing compile lines are as I showed in my original post.

Hi Ron,

The problem here is that we don’t support the use of derived type members in the cache clause. I’ve added an RFE (TPR#24957) and sent it to engineering to see what we can do.

-Mat

Cool thanks!

I am trying to speedup my stencil code as much as possible.

Does the tile clause work with derived type members?

If the cache clause is not supported, does that mean the use of shared memory by the compiler is not happening with derived type members as well?

Does the tile clause work with derived type members?

Tile is a loop clause so it doesn’t matter what data types are in the loop in order for it to function.

If the cache clause is not supported, does that mean the use of shared memory by the compiler is not happening with derived type members as well?

If the data is in global memory, then the compiler wont put it into shared memory.

  • Mat

Has the cache clause been made compatible with derived types yet (18.4)?

Hi Ron,

No, sorry not yet. I went over and poked the compiler engineer to whom this is assigned. Hopefully he can get it in near future.

-Mat

Has the cache clause been made compatible with derived types yet (19.5)?