Private array in acc loop

HI, I would like to ask for some help for compiling my OpenACC program. The program is simplified and shown below:

module gulemath
contains

     SUBROUTINE pnmcc (A,N,Ifct,fct)
     !$acc routine  seq
  IMPLICIT NONE
 INTEGER*4          N,Ifct(:)
 COMPLEX*16         A(:)
 DOUBLE PRECISION   fct(:)

! INTEGER4 N,Ifct(6N+150)
! COMPLEX16 A(N)
! DOUBLE PRECISION fct(6
N+150)
INTEGER*4 K,J

  DO K=1,N
	  A(K)=N
  ENDDO

  END

end module gulemath

program gule
use gulemath
implicit none
integer(kind=4)::i,j,n_east,n_north,num_grid,nmax
real(kind=8),allocatable::fct(:),gravobv(:)
integer,allocatable:: ifct(:)
complex*16,allocatable:: pnmdata_cpx(:)

write(*,*) "n_east,n_north,num_grid,nmax"
read(*,*) n_east,n_north,num_grid,nmax

allocate(gravobv(num_grid))
allocate(pnmdata_cpx(n_east))
allocate(ifct(6*n_east+150))
allocate(fct(6*n_east+150))

!$acc kernels create(pnmdata_cpx)
!$acc loop   private(pnmdata_cpx)
do i=1, n_north
	pnmdata_cpx=dcmplx(0.D0,0.D0)
	!$acc loop independent
	do j=1, n_east
	        pnmdata_cpx(j)=dcmplx(gravobv((i-1)*n_east+j))
	end do
	call pnmcc(pnmdata_cpx,n_east,ifct,fct)
end do   ! loop i
!$acc end kernels

end ! the main program

In the acc kernel region, I would like to define the array pnmdata_cpx as private array for each loop. However, the program could not be compiled successfully. The compile command is:

mpif90 -acc -gpu=cc70 -gpu=cuda11.0 -Minfo ggt.f90 -o ggt

and the error informations are shown below:
pnmcc:
5, Generating acc routine seq
Generating Tesla code
13, Memory set idiom, loop replaced by call to __c_mset16
NVFORTRAN-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): No device symbol for address reference (ggt.f90: 40)
gule:
40, Generating create(pnmdata_cpx(:)) [if not already present]
42, Generating Tesla code
42, !$acc loop seq
43, !$acc loop vector(128) ! threadidx%x
45, !$acc loop vector(128) ! threadidx%x
43, Loop is parallelizable
45, Loop is parallelizable
48, Reference argument passing prevents parallelization: n_east
NVFORTRAN-F-0704-Compilation aborted due to previous errors. (ggt.f90)
NVFORTRAN/x86-64 Linux 20.9-0: compilation aborted

However, the program could be successfully if I remove the clause to define pnmdata_cpx as private:
!$acc kernels create(pnmdata_cpx)
!$acc loop
do i=1, n_north
end do ! loop i
!$acc end kernels

or if I define the arguemnts in the subroutine like this:
SUBROUTINE pnmcc (A,N,Ifct,fct)
!$acc routine seq
IMPLICIT NONE

INTEGER*4          N,Ifct(6*N+150)
COMPLEX*16         A(N)
DOUBLE PRECISION   fct(6*N+150)
INTEGER*4          K,J

  END

Could you please tell me the reason behind and give me an suggestion for this problem?

Many thanks!

Hi wliang246,

This is actually a know limitation when passing private arrays using assumed shape. The work around is to pass using assumed size. Note that variables can’t be both shared (i.e. in the “create” clause) and private.

-Mat

 % cat test.f90
module gulemath
contains

     SUBROUTINE pnmcc (A,N,Ifct,fct)
     !$acc routine vector
  IMPLICIT NONE
 INTEGER*4,value   :: N
 INTEGER*4    ::      Ifct(:)
 complex*16   ::      A(*)
 DOUBLE PRECISION   fct(:)
! INTEGER4 N,Ifct(6N+150)
! COMPLEX16 A(N)
! DOUBLE PRECISION fct(6N+150)
INTEGER*4 K,J
!$acc loop vector
  DO K=1,N
          A(K)=dcmplx(real(N),0.D0)
  ENDDO

  END
end module gulemath

program gule
use gulemath
implicit none
integer(kind=4)::i,j,n_east,n_north,num_grid,nmax
real(kind=8),allocatable::fct(:),gravobv(:)
integer,allocatable:: ifct(:)
complex*16,allocatable:: pnmdata_cpx(:)

write(*,*) "n_east,n_north,num_grid,nmax"
read(*,*) n_east,n_north,num_grid,nmax

allocate(gravobv(num_grid))
allocate(pnmdata_cpx(n_east))
allocate(ifct(6*n_east+150))
allocate(fct(6*n_east+150))

!$acc kernels copyin(ifct,fct)
!$acc loop private(pnmdata_cpx)
do i=1, n_north
        pnmdata_cpx=dcmplx(0.D0,0.D0)
        !$acc loop independent
        do j=1, n_east
                pnmdata_cpx(j)=dcmplx(gravobv((i-1)*n_east+j))
        end do
        call pnmcc(pnmdata_cpx,n_east,ifct,fct)
end do   ! loop i
!$acc end kernels
end ! the main program
% nvfortran -acc -Minfo=accel test.f90
pnmcc:
      4, Generating Tesla code
         16, !$acc loop vector ! threadidx%x
     16, Loop is parallelizable
gule:
     39, Generating implicit copyin(gravobv(:)) [if not already present]
         Generating copyin(ifct(:),fct(:)) [if not already present]
     41, Loop is parallelizable
         Generating Tesla code
         41, !$acc loop gang ! blockidx%x
         42, !$acc loop vector(32) ! threadidx%x
         44, !$acc loop vector(32) ! threadidx%x
     42, Loop is parallelizable
     44, Loop is parallelizable

Thanks, this method works. However, if the array is two-dimension like this:
module gulemath

contains

     SUBROUTINE pnmcc (A,N,Ifct,fct)
     !$acc routine  seq
  IMPLICIT NONE
 INTEGER*4          N,Ifct(:)
 COMPLEX*16         A(:,:)
 DOUBLE PRECISION   fct(:)

! INTEGER4 N,Ifct(6N+150)
! COMPLEX16 A(N)
! DOUBLE PRECISION fct(6
N+150)
INTEGER*4 K,J

  DO K=1,N
	  A(K)=N
  ENDDO

  END

end module gulemath

program gule
use gulemath

implicit none

integer(kind=4)::i,j,n_east,n_north,num_grid,nmax
real(kind=8),allocatable::fct(:),gravobv(:)
integer,allocatable:: ifct(:)
complex*16,allocatable:: pnmdata_cpx(:,:)

write(*,*) "n_east,n_north,num_grid,nmax"
read(*,*) n_east,n_north,num_grid,nmax

allocate(gravobv(num_grid))
allocate(pnmdata_cpx(n_east,2))
allocate(ifct(6*n_east+150))
allocate(fct(6*n_east+150))

!$acc kernels 
!$acc loop   private(pnmdata_cpx)
do i=1, n_north
	pnmdata_cpx=dcmplx(0.D0,0.D0)
	!$acc loop independent
	do j=1, n_east
	        pnmdata_cpx(j,1)=dcmplx(gravobv((i-1)*n_east+j))
	end do
	call pnmcc(pnmdata_cpx,n_east,ifct,fct)
end do   ! loop i
!$acc end kernels

end ! the main program

how to define its assumed shape, using A(,) here? However, it does not work.

In that case, declare using the bounds:

COMPLEX*16 A(N,2)

You just need to avoid using assumed shape array arguments when passing in private arrays.

OK, I just would like to declare arrays in assumed shape to save the stack consumption.

Another option is to manually privatize the array by adding another dimension with “n_north” as the size and making the array global. Something like:

module gulemath

contains

   SUBROUTINE pnmcc (A,N,Ifct,fct,idx)
   !$acc routine  seq
  IMPLICIT NONE
 INTEGER*4          N,Ifct(:)
 INTEGER, value :: idx
 COMPLEX*16         A(:,:,:)
 DOUBLE PRECISION   fct(:)
! INTEGER4 N,Ifct(6N+150)
! COMPLEX16 A(N)
! DOUBLE PRECISION fct(6N+150)
INTEGER*4 K,J

  DO K=1,N
          A(K,1,idx)=N
  ENDDO

  END
end module gulemath

program gule
use gulemath

implicit none

integer(kind=4)::i,j,n_east,n_north,num_grid,nmax
real(kind=8),allocatable::fct(:),gravobv(:)
integer,allocatable:: ifct(:)
complex*16,allocatable:: pnmdata_cpx(:,:,:)

write(*,*) "n_east,n_north,num_grid,nmax"
read(*,*) n_east,n_north,num_grid,nmax

allocate(gravobv(num_grid))
allocate(pnmdata_cpx(n_east,2,n_north))
allocate(ifct(6*n_east+150))
allocate(fct(6*n_east+150))

!$acc kernels create(pnmdata_cpx)
!$acc loop
do i=1, n_north
        pnmdata_cpx=dcmplx(0.D0,0.D0)
        !$acc loop independent
        do j=1, n_east
                pnmdata_cpx(j,1,i)=dcmplx(gravobv((i-1)*n_east+j))
        end do
        call pnmcc(pnmdata_cpx,n_east,ifct,fct,i)
end do   ! loop i
!$acc end kernels
end ! the main program

Many thanks, Mat!

It is also a good way to get the array private, by the way I would like to inquire that will the private arrays in the parallel region use the stack or heap or not?

Moreover, it is very often in some application to call subroutines which would always use the heap and the stack. Now, some of my programs are in this case, and the program will always end with error information as “Illegal address during kernel execution”, which is probably due to the lack of heap or stack. Is there any skills or tricks to deal with this problem? Or is there any literature on this issue?

Neither. Private arrays will either be stored in Shared memory (when private to a gang and will fit) or as a set of arrays in global memory (when private to a vector or if it doesn’t fit in shared memory for a gang). The heap is only used when allocating from device code, including automatics (which should be avoided). The stack is used to hold the passed routine arguments and local variable in a device routine (unless the routine gets inlined).

Private scalars are held in registers (when private to vector) or shared memory (when private to a gang)

Now, some of my programs are in this case, and the program will always end with error information as “Illegal address during kernel execution”, which is probably due to the lack of heap or stack.

While it’s possible to get this error with a heap or stack overflow, I’m guessing in this case you may just need to add “-Mlarge_arrays”. If you have a private array on a vector loop, the compiler will allocate one large chunk of memory sized to number of array elements times the number of vectors. By default, 32-bit offsets are used so if the total size of the private arrays is > 2GB, you can get the illegal address error. Using “-Mlarge_arrays” will have the compiler use 64-bit offsets.

I have got it and I will try to use the switch “-Mlarge_arrays”.

Many thanks and best regards!

By the way I would like to ask what is reason behind the error message “FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED” when running the program. Is it because the lack of heap or stack memory?

Thanks!

The error occurs when allocating automatic arrays on the device when the device size malloc fails. It’s best to avoid using automatics in device routines. Besides the heap being small, device side allocation is serialized thus negatively impacting performance.

OK, many thanks!

I have now changed my codes and all the allocating work is removed.

Moreover I have another problem about passing arguments and I would post it as a new topic later.

1 Like