Can I specify vector length in a kernels region?

I do have automatic arrays.

to get it to run I already had to set:

setenv PGI_ACC_CUDA_HEAPSIZE 67000000

Is that the same thing?

I tried setting NV_ACC_CUDA_HEAPSIZE

from 67000000 to 500000000

But that did not fix it. I may try removing the automatic arrays.

Thanks,

Jacques

Yes, although the older ā€œPGIā€ prefix is deprecated. ā€œNVCOMPILERā€ is the of official prefix for environment variables, but I prefer the abbreviated ā€œNVā€ which is also acceptable.

Hi Mat,

In subroutine mynn_tendencies I changed all 19 automatic arrays to arrays in the calling sequence and, in the calling routine, put them in a private clause. That sped up the entire main loop from 1.33 seconds to .90 seconds which is .68% of 1.33. Now Iā€™m going to look for other subroutines with automatic arrays.

Thanks for the great tip!

Jacques

1 Like

Hi Mat,

I removed all the automatic array and it sped up by xx%. I donā€™t know whatā€™s taking the remaining time but I wonder if it is private arrays. I have a kernels directive on the main loop that I time and that kernels directive specifies 180 private arrays, most dimensioned 128 and some dimensioned 128,10.

Hi Matt,

I removed all the automatic array and it sped up by 4X. I donā€™t know whatā€™s taking the remaining time but I wonder if it is private arrays. I time the main loop. On the main loop I have a kernels directive which specifies 180 private arrays, most dimensioned (128) and some dimensioned (128,10). Does that take a lot of start-up time?

Thanks,

Jacques

Well, the private arrays do need to get allocated. Normally the overhead time is not significant, but 180 arrays could take awhile. I personally havenā€™t used this many. Granted the device memory should get re-used and the allocation time only impact the first time the kernel is called.

Have you profiled the code? If not, I suggest profiling using Nsight-Systems with OpenACC tracing enabled (i.e. ā€œnsys profile -o -t cuda,openacc ā€, optionally add ā€œā€“stats=trueā€ to see the text output). This will should the device memory allocation time.

-Mat

Hi Mat,

I have a question about pointers. Iā€™m trying to use pointers to replace automatic arrays for the GPU. Specifically, the following code works for mapping a 2D array to a 1D array (t1 to p1) but not for mapping a 3D array to a 2D array (t2 to p2). The compilation error I get is:

NVFORTRAN-S-0155-Illegal POINTER assignment - pointer target must be simply contiguous (module_common.F90: 14)

I appears to me that both cases are simply contiguous. Is there something Iā€™m missing here? Is there a better way to do this?

module common

implicit none

save

real, pointer :: p1(:) , p2(:,:)

real, allocatable, target :: t1(:,:) , t2(:,:,:)

contains

subroutine setPointers(index)

integer,intent(in) :: index

p1( lbound(t1,1) : ubound(t1,1) ) => t1(lbound(t1,1) : ubound(t1,1) , index)

p2( lbound(t2,1) : ubound(t2,1) , lbound(t2,2) : ubound(t2,2) ) => t2( lbound(t2,1) : ubound(t2,1) , lbound(t2,2) : ubound(t2,2) , index )

end subroutine setPointers

end module common

Thanks,

Jacques

I also tried:

UPCHEM(:,:,:) => UPCHEMI(:,:,:,index)

and that gave:

NVFORTRAN-S-0155-Illegal POINTER assignment - illegal implied lowerbound in destination pointer section (/scratch2/BMC/gsd-hpcs/Jacques.Middlecoff/testNewPBL/ref/src/module_DMP_mf_pointers.F90: 85)

NVFORTRAN-S-0155-Illegal POINTER assignment - non-POINTER object (/scratch2/BMC/gsd-hpcs/Jacques.Middlecoff/testNewPBL/ref/src/module_DMP_mf_pointers.F90: 85)

So I tried:

UPCHEM(:,:,:) => UPCHEMI(:,:,:,index:index)

But that gave the same error message.

Jacques

Hi Jacques,

Try:

p2 => t2( lbound(t2,1) : ubound(t2,1) , lbound(t2,2) : ubound(t2,2) , index )

and

UPCHEM => UPCHEMI(:,:,:,index)

-Mat

Hi Mat,

UPCHEM => UPCHEMI(:,:,:,index)

Works!

I donā€™t understand why but that lets me proceed.

Thanks!

Jacques

Hi Mat,

Another OpenACC question. I have the following routine:

MODULE DMP_mf_pointers

!This module replaces automatic arrays with pointers for the GPU

IMPLICIT NONE

REAL, DIMENSION(:,:,:,:), ALLOCATABLE, TARGET :: UPCHEMI

REAL, DIMENSION(:,:,:) , POINTER :: UPCHEM

!$acc declare create( UPCHEM )

!$acc declare create( UPCHEMI )

CONTAINS

SUBROUTINE allocate_DMP_mf_pointers(KTS,KTE,NUP,nchem,ITS,ITE)

!$acc routine seq

INTEGER, INTENT(IN) :: KTS,KTE,NUP,nchem,ITS,ITE

ALLOCATE ( UPCHEMI (KTS:KTE+1,1:NUP,1:nchem,ITS:ITE) )

END SUBROUTINE allocate_DMP_mf_pointers

SUBROUTINE set_DMP_mf_pointers(index)

!$acc routine seq

INTEGER,INTENT(in) :: index

UPCHEM => UPCHEMI (:,:,:,index)

END SUBROUTINE set_DMP_mf_pointers

END MODULE DMP_mf_pointers

And when I compile it I get:

1110> pgf90 -acc -Minfo=all -c testptrs.F90

NVFORTRAN-W-1054-Module variables used in acc routine need to be in acc declare create() - upchemi$dev (testptrs.F90: 12)

0 inform, 1 warnings, 0 severes, 0 fatal for allocate_dmp_mf_pointers

set_dmp_mf_pointers:

22, Generating acc routine seq

Generating NVIDIA GPU code

What am I doing wrong?

Thanks,

Jacques

Allocation and pointer assignment should be done on the host side.

Fortran pointer assignment is not supported on the device. If you do need device pointer assignment, youā€™ll need to switch to use Cray pointers (which are basically C pointers).

While allocation is supported on the device, itā€™s not recommended given the device heap is small so a program can easily overflow the heap and it can hurt performance.

Also here, it would be illegal OpenACC. The device and host copies of an array in a data region need to match and an allocate will create both a host and device copy.

Now you can use the pointer in device code, but do need to ā€œattachā€ it so itā€™s pointing at the correct device address.

Something like the following:

% cat test2.f90
MODULE DMP_mf_pointers

!This module replaces automatic arrays with pointers for the GPU

IMPLICIT NONE

REAL, DIMENSION(:,:,:,:), ALLOCATABLE, TARGET :: UPCHEMI
REAL, DIMENSION(:,:,:) , POINTER :: UPCHEM

!$acc declare create( UPCHEMI )
!$acc declare create( UPCHEM )
CONTAINS

SUBROUTINE allocate_DMP_mf_pointers(KTS,KTE,NUP,nchem,ITS,ITE)
INTEGER, INTENT(IN) :: KTS,KTE,NUP,nchem,ITS,ITE
! OpenACC allocates both the host and device copies of the array
! when the array is in a declare create directive
ALLOCATE ( UPCHEMI (KTS:KTE+1,1:NUP,1:nchem,ITS:ITE) )
END SUBROUTINE allocate_DMP_mf_pointers

SUBROUTINE set_DMP_mf_pointers(index)
INTEGER,INTENT(in) :: index
UPCHEM => UPCHEMI (:,:,:,index)
! Update the device pointer to point at the correct UPCHEMI index
!$acc enter data attach(UPCHEM)
END SUBROUTINE set_DMP_mf_pointers

END MODULE DMP_mf_pointers

program foo

use DMP_mf_pointers

integer :: KTS,KTE,NUP,nchem,ITS,ITE,idx,i,j,k

KTS=1
KTE=64
NUP=64
nchem=64
ITS=1
ITE=64

call allocate_DMP_mf_pointers(KTS,KTE,NUP,nchem,ITS,ITE)
do idx = ITS,ITE
   call set_DMP_mf_pointers(idx)
!$acc parallel loop collapse(3) present(UPCHEM)
   do i=KTS,KTE
   do j=1,NUP
   do k=1,nchem
     UPCHEM(i,j,k) = 1
   enddo
   enddo
   enddo
enddo
!$acc update self(UPCHEMI)
print *, UPCHEMI(1,1,1,1)

end program

% nvfortran test2.f90 -acc -Minfo=accel; a.out
set_dmp_mf_pointers:
     24, Generating enter data attach(upchem)
foo:
     45, Generating NVIDIA GPU code
         46, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
         47,   ! blockidx%x threadidx%x collapsed
         48,   ! blockidx%x threadidx%x collapsed
     54, Generating update self(upchemi(:,:,:,:))
    1.000000

Hi Mat,

I do need device pointer assignment because the code is of the form:

!$acc kernels(ā€¦
DO idx = its,ite
CALL SUBROUTINE sub(idx,jmax,kmax)

Where subroutine sub is of the form:
SUBROUTINE sub(idx,jmax,kmax)
.
.
.
!!! REAL, DIMENSION(jmax,kmax) :: upchem

call set_pointers(idx) ! Where set_pointers does: upchem => upchemi(:,:,idx)

upchem(1:jmax,1:kmax) = 0.0

But it seemsI canā€™t use Cray pointers either because Cray pointers must be scalars, not arrays.

So it looks like I canā€™t use pointers to replace automatic arrays like

REAL, DIMENSION(jmax,kmax) :: upchem

Jacques

Even if you could do pointer assignment, given UPCHEM is a shared module variable, youā€™d have a race condition. Each thread need itā€™s own private copy.

So it looks like I canā€™t use pointers to replace automatic arrays like

Typically this is done by hoisting declaration of the automatic array to the caller and then add it to a ā€œprivateā€ clause on the OpenACC compute region. Then pass the private array into the subroutine.

Though it seems that youā€™re manually privatizing it using UPCHEMI. Why not access UPCHEMI directly? Are you trying to minimize the amount of code changes?

I thought I could put UPCHEM in a private statement so each thread would have its own pointer location. But, from what you wrote, apparently not.

YES! Iā€™m trying to minimize the amount of code changes. My first solution, which works, was to pass in UPCHEM(:,:,IDX) though the calling sequence but there are seven subroutines with a lot of automatic variables and the developers do not like the extensive code changes.

The reason for the IDX index is to reduce the amount of space used by private variables.

Jacques