Cache directive with derived type problem

caplanr · November 21, 2017, 9:27pm

Hi,

I am trying to do this:

!$acc parallel default(present) present(ps) async(1)
!$acc loop
      do k=2,npm1
!$acc loop
        do j=2,ntm1
!$acc loop
          do i=2,nrm-1
!$acc cache(ps%r(i,j,k),ps%t(i,j,k),ps%p(i,j,k))
            ii=ntm2*(nrm-2)*(k-2)+(nrm-2)*(j-2)+(i-1)
            q(ii)=a_r( i,j,k,1)*ps%r(i  ,j  ,k-1)
     &           +a_r( i,j,k,2)*ps%r(i  ,j-1,k  )
     &           +a_r( i,j,k,3)*ps%r(i-1,j  ,k  )
     &           +a_r( i,j,k,4)*ps%r(i  ,j  ,k  )
     &           +a_r( i,j,k,5)*ps%r(i+1,j  ,k  )
     &           +a_r( i,j,k,6)*ps%r(i  ,j+1,k  )
     &           +a_r( i,j,k,7)*ps%r(i  ,j  ,k+1)
     &           +a_r( i,j,k,8)*ps%t(i  ,j-1,k  )
     &           +a_r( i,j,k,9)*ps%t(i+1,j-1,k  )
     &           +a_r(i,j,k,10)*ps%t(i  ,j  ,k  )
     &           +a_r(i,j,k,11)*ps%t(i+1,j  ,k  )
     &           +a_r(i,j,k,12)*ps%p(i  ,j  ,k-1)
     &           +a_r(i,j,k,13)*ps%p(i+1,j  ,k-1)
     &           +a_r(i,j,k,14)*ps%p(i  ,j  ,k  )
     &           +a_r(i,j,k,15)*ps%p(i+1,j  ,k  )
          enddo
        enddo
      enddo
!$acc end parallel

and am getting this error:

PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Could not find allocated-variable index for symbol (mas_sed_expmac.f: 23885)

but then I also see this:

  23885, Generating present(ps)
         Accelerator kernel generated
         Generating Tesla code
      23887, !$acc loop gang ! blockidx%x
      23889, !$acc loop seq
      23891, !$acc loop vector(128) ! threadidx%x
             Cached references to size [(x)] block of t,r,p
  23889, Loop is parallelizable
  23891, Loop is parallelizable

It seems the cache doesn’t like my derived type arrays…

tull · November 22, 2017, 7:11pm

I would ask for the whole file mas_sed_expmac.f, but it looks like it
is huge.

If you could send the function/subroutine with line 23885, along
with the sources to any modules or headers the function/subroutine uses, there
is a chance we could get this into a path to correction.

We can’t compile what you sent. Not enough there.

Also would like to know the output of

pgfortran -V ! to get the cpu type.

and what your failing compile line looks like.

You should be able to successfully compile the file with and w/o
-acc in the compile line.

dave

caplanr · November 22, 2017, 10:09pm

Hi,

The full routine is:

      subroutine one_minus_div_grad_v (ps,q)
c
      use number_types
      use types
      use globals
      use matrix_storage_v_solve
c
      implicit none
c
      type(vvec) :: ps
      real(r_typ), dimension(N_cgvec) :: q
c
      integer :: i,j,k,ii
c
!$acc parallel default(present) present(ps) async(1)
!$acc loop
      do k=2,npm1
!$acc loop
        do j=2,ntm1
!$acc loop
          do i=2,nrm-1
!$acc cache(ps%r(i,j,k),ps%t(i,j,k),ps%p(i,j,k))
            ii=ntm2*(nrm-2)*(k-2)+(nrm-2)*(j-2)+(i-1)
            q(ii)=a_r( i,j,k,1)*ps%r(i  ,j  ,k-1)
     &           +a_r( i,j,k,2)*ps%r(i  ,j-1,k  )
     &           +a_r( i,j,k,3)*ps%r(i-1,j  ,k  )
     &           +a_r( i,j,k,4)*ps%r(i  ,j  ,k  )
     &           +a_r( i,j,k,5)*ps%r(i+1,j  ,k  )
     &           +a_r( i,j,k,6)*ps%r(i  ,j+1,k  )
     &           +a_r( i,j,k,7)*ps%r(i  ,j  ,k+1)
     &           +a_r( i,j,k,8)*ps%t(i  ,j-1,k  )
     &           +a_r( i,j,k,9)*ps%t(i+1,j-1,k  )
     &           +a_r(i,j,k,10)*ps%t(i  ,j  ,k  )
     &           +a_r(i,j,k,11)*ps%t(i+1,j  ,k  )
     &           +a_r(i,j,k,12)*ps%p(i  ,j  ,k-1)
     &           +a_r(i,j,k,13)*ps%p(i+1,j  ,k-1)
     &           +a_r(i,j,k,14)*ps%p(i  ,j  ,k  )
     &           +a_r(i,j,k,15)*ps%p(i+1,j  ,k  )
          enddo
        enddo
      enddo
!$acc end parallel
c
!$acc parallel default(present) present(ps) async(2)
!$acc loop
      do k=2,npm1
!$acc loop
        do j=2,ntm-1
!$acc loop
          do i=2,nrm1
!$acc cache(ps%r(i,j,k),ps%t(i,j,k),ps%p(i,j,k))
            ii=(npm2*ntm2*(nrm-2))
     &         +(ntm-2)*nrm2*(k-2)+nrm2*(j-2)+(i-1)
            q(ii)=
     &           a_t(i,j,k, 1)*ps%r(i-1,j  ,k  )
     &          +a_t(i,j,k, 2)*ps%r(i  ,j  ,k  )
     &          +a_t(i,j,k, 3)*ps%r(i-1,j+1,k  )
     &          +a_t(i,j,k, 4)*ps%r(i  ,j+1,k  )
     &          +a_t(i,j,k, 5)*ps%t(i  ,j  ,k-1)
     &          +a_t(i,j,k, 6)*ps%t(i  ,j-1,k  )
     &          +a_t(i,j,k, 7)*ps%t(i-1,j  ,k  )
     &          +a_t(i,j,k, 8)*ps%t(i  ,j  ,k  )
     &          +a_t(i,j,k, 9)*ps%t(i+1,j  ,k  )
     &          +a_t(i,j,k,10)*ps%t(i  ,j+1,k  )
     &          +a_t(i,j,k,11)*ps%t(i  ,j  ,k+1)
     &          +a_t(i,j,k,12)*ps%p(i  ,j  ,k-1)
     &          +a_t(i,j,k,13)*ps%p(i  ,j+1,k-1)
     &          +a_t(i,j,k,14)*ps%p(i  ,j  ,k  )
     &          +a_t(i,j,k,15)*ps%p(i  ,j+1,k  )
          enddo
        enddo
      enddo
!$acc end parallel
c
!$acc parallel default(present) present(ps) async(3)
!$acc loop
      do k=2,npm-1
!$acc loop
        do j=2,ntm1
!$acc loop
          do i=2,nrm1
!$acc cache(ps%r(i,j,k),ps%t(i,j,k),ps%p(i,j,k))
            ii=(npm2*ntm2*(nrm-2))+(npm2*(ntm-2)*nrm2)
     &         +ntm2*nrm2*(k-2)+nrm2*(j-2)+(i-1)
            q(ii)=
     &            a_p(i,j,k, 1)*ps%r(i-1,j  ,k  )
     &           +a_p(i,j,k, 2)*ps%r(i  ,j  ,k  )
     &           +a_p(i,j,k, 3)*ps%r(i-1,j  ,k+1)
     &           +a_p(i,j,k, 4)*ps%r(i  ,j  ,k+1)
     &           +a_p(i,j,k, 5)*ps%t(i  ,j-1,k  )
     &           +a_p(i,j,k, 6)*ps%t(i  ,j  ,k  )
     &           +a_p(i,j,k, 7)*ps%t(i  ,j-1,k+1)
     &           +a_p(i,j,k, 8)*ps%t(i  ,j  ,k+1)
     &           +a_p(i,j,k, 9)*ps%p(i  ,j  ,k-1)
     &           +a_p(i,j,k,10)*ps%p(i  ,j-1,k  )
     &           +a_p(i,j,k,11)*ps%p(i-1,j  ,k  )
     &           +a_p(i,j,k,12)*ps%p(i  ,j  ,k  )
     &           +a_p(i,j,k,13)*ps%p(i+1,j  ,k  )
     &           +a_p(i,j,k,14)*ps%p(i  ,j+1,k  )
     &           +a_p(i,j,k,15)*ps%p(i  ,j  ,k+1)
          enddo
        enddo
      enddo
!$acc end parallel
c
!$acc wait
c
      end subroutine

The relevant types are:

      module number_types
c
      use iso_fortran_env
c
      implicit none
c
      integer, parameter :: KIND_REAL_8=REAL64
c
      integer, private, parameter :: r8=KIND_REAL_8
c
      integer, parameter :: r_typ=r8
      end module

      module types
c
      use number_types
c
      implicit none
c
      type :: vvec
        real(r_typ), dimension(:,:,:), allocatable :: r !(nrm,nt,np)
        real(r_typ), dimension(:,:,:), allocatable :: t !(nr,ntm,np)
        real(r_typ), dimension(:,:,:), allocatable :: p !(nr,nt,npm)
      end type
      end module

The a_r, a_t, and a_p are simple allocatable arrays in the matrix module. Their sizes are:

      allocate (a_r(2:nrm-1, 2:ntm1,  2:npm1  ,15))
      allocate (a_t(2:nrm1,  2:ntm-1, 2:npm1  ,15))
      allocate (a_p(2:nrm1,  2:ntm1,  2:npm-1 ,15))

The value of N_cgvec is:

      N_vr=(nrm-2)*ntm2*npm2
      N_vt=nrm2*(ntm-2)*npm2
      N_vp=nrm2*ntm2*(npm-2)
c
      N_cgvec=N_vr+N_vt+N_vp

caplanr · November 22, 2017, 10:11pm

I forgot the other info:

PREDSCI-GPU2: ~/Dropbox/PSI/MAS/MAS_SVN_LOCAL_BRANCHES/mas_openacc/axidx $ pgfortran -V

pgfortran 17.9-0 64-bit target on x86-64 Linux -tp haswell 
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

The failing compile lines are as I showed in my original post.

MatColgrove · November 27, 2017, 8:40pm

Hi Ron,

The problem here is that we don’t support the use of derived type members in the cache clause. I’ve added an RFE (TPR#24957) and sent it to engineering to see what we can do.

-Mat

caplanr · December 3, 2017, 9:40pm

Cool thanks!

I am trying to speedup my stencil code as much as possible.

Does the tile clause work with derived type members?

If the cache clause is not supported, does that mean the use of shared memory by the compiler is not happening with derived type members as well?

MatColgrove · December 4, 2017, 3:48pm

Does the tile clause work with derived type members?

Tile is a loop clause so it doesn’t matter what data types are in the loop in order for it to function.

If the cache clause is not supported, does that mean the use of shared memory by the compiler is not happening with derived type members as well?

If the data is in global memory, then the compiler wont put it into shared memory.

Mat

caplanr · May 1, 2018, 7:35pm

Has the cache clause been made compatible with derived types yet (18.4)?

MatColgrove · May 3, 2018, 9:13pm

Hi Ron,

No, sorry not yet. I went over and poked the compiler engineer to whom this is assigned. Hopefully he can get it in near future.

-Mat

caplanr · June 20, 2019, 1:06am

Has the cache clause been made compatible with derived types yet (19.5)?

Topic		Replies	Views
FATAL ERROR at run time Legacy PGI Compilers	5	8122	December 18, 2014
Questions on cache directive capability Legacy PGI Compilers	3	3599	June 17, 2014
Deep copy of nested data types Legacy PGI Compilers	6	5922	January 6, 2020
Compiling with C++ stdlib Procedures Legacy PGI Compilers	7	9736	January 7, 2015
Procedure in derived type data Legacy PGI Compilers	7	4331	December 30, 2018
Avoid reallocating memory on the GPU. Legacy PGI Compilers	11	5625	January 28, 2013
OpenACC: Problem with present directive and module array Legacy PGI Compilers	14	9271	August 14, 2012
13.8 Unexpected load/store type when use cache Legacy PGI Compilers	8	8438	March 14, 2014
understanding problems with acc directives. Legacy PGI Compilers	7	12675	May 3, 2010
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10421	May 2, 2014

Cache directive with derived type problem

Related topics