Can I specify vector length in a kernels region?

jacques.middlecoff · January 21, 2023, 2:15am

I do have automatic arrays.

to get it to run I already had to set:

setenv PGI_ACC_CUDA_HEAPSIZE 67000000

Is that the same thing?

I tried setting NV_ACC_CUDA_HEAPSIZE

from 67000000 to 500000000

But that did not fix it. I may try removing the automatic arrays.

Thanks,

Jacques

MatColgrove · January 23, 2023, 4:30pm

Yes, although the older “PGI” prefix is deprecated. “NVCOMPILER” is the of official prefix for environment variables, but I prefer the abbreviated “NV” which is also acceptable.

jacques.middlecoff · January 23, 2023, 8:14pm

Hi Mat,

In subroutine mynn_tendencies I changed all 19 automatic arrays to arrays in the calling sequence and, in the calling routine, put them in a private clause. That sped up the entire main loop from 1.33 seconds to .90 seconds which is .68% of 1.33. Now I’m going to look for other subroutines with automatic arrays.

Thanks for the great tip!

Jacques

jacques.middlecoff · January 31, 2023, 5:02am

Hi Mat,

I removed all the automatic array and it sped up by xx%. I don’t know what’s taking the remaining time but I wonder if it is private arrays. I have a kernels directive on the main loop that I time and that kernels directive specifies 180 private arrays, most dimensioned 128 and some dimensioned 128,10.

jacques.middlecoff · January 31, 2023, 5:12am

Hi Matt,

I removed all the automatic array and it sped up by 4X. I don’t know what’s taking the remaining time but I wonder if it is private arrays. I time the main loop. On the main loop I have a kernels directive which specifies 180 private arrays, most dimensioned (128) and some dimensioned (128,10). Does that take a lot of start-up time?

Thanks,

Jacques

MatColgrove · January 31, 2023, 4:44pm

Well, the private arrays do need to get allocated. Normally the overhead time is not significant, but 180 arrays could take awhile. I personally haven’t used this many. Granted the device memory should get re-used and the allocation time only impact the first time the kernel is called.

Have you profiled the code? If not, I suggest profiling using Nsight-Systems with OpenACC tracing enabled (i.e. “nsys profile -o -t cuda,openacc ”, optionally add “–stats=true” to see the text output). This will should the device memory allocation time.

-Mat

jacques.middlecoff · May 2, 2023, 3:01am

Hi Mat,

I have a question about pointers. I’m trying to use pointers to replace automatic arrays for the GPU. Specifically, the following code works for mapping a 2D array to a 1D array (t1 to p1) but not for mapping a 3D array to a 2D array (t2 to p2). The compilation error I get is:

NVFORTRAN-S-0155-Illegal POINTER assignment - pointer target must be simply contiguous (module_common.F90: 14)

I appears to me that both cases are simply contiguous. Is there something I’m missing here? Is there a better way to do this?

module common

implicit none

save

real, pointer :: p1(:) , p2(:,:)

real, allocatable, target :: t1(:,:) , t2(:,:,:)

contains

subroutine setPointers(index)

integer,intent(in) :: index

p1( lbound(t1,1) : ubound(t1,1) ) => t1(lbound(t1,1) : ubound(t1,1) , index)

p2( lbound(t2,1) : ubound(t2,1) , lbound(t2,2) : ubound(t2,2) ) => t2( lbound(t2,1) : ubound(t2,1) , lbound(t2,2) : ubound(t2,2) , index )

end subroutine setPointers

end module common

Thanks,

Jacques

jacques.middlecoff · May 2, 2023, 6:49am

I also tried:

UPCHEM(:,:,:) => UPCHEMI(:,:,:,index)

and that gave:

NVFORTRAN-S-0155-Illegal POINTER assignment - illegal implied lowerbound in destination pointer section (/scratch2/BMC/gsd-hpcs/Jacques.Middlecoff/testNewPBL/ref/src/module_DMP_mf_pointers.F90: 85)

NVFORTRAN-S-0155-Illegal POINTER assignment - non-POINTER object (/scratch2/BMC/gsd-hpcs/Jacques.Middlecoff/testNewPBL/ref/src/module_DMP_mf_pointers.F90: 85)

So I tried:

UPCHEM(:,:,:) => UPCHEMI(:,:,:,index:index)

But that gave the same error message.

Jacques

MatColgrove · May 2, 2023, 2:53pm

Hi Jacques,

Try:

p2 => t2( lbound(t2,1) : ubound(t2,1) , lbound(t2,2) : ubound(t2,2) , index )

and

UPCHEM => UPCHEMI(:,:,:,index)

-Mat

jacques.middlecoff · May 2, 2023, 6:48pm

Hi Mat,

UPCHEM => UPCHEMI(:,:,:,index)

Works!

I don’t understand why but that lets me proceed.

Thanks!

Jacques

jacques.middlecoff · May 20, 2023, 2:06am

Hi Mat,

Another OpenACC question. I have the following routine:

MODULE DMP_mf_pointers

!This module replaces automatic arrays with pointers for the GPU

IMPLICIT NONE

REAL, DIMENSION(:,:,:,:), ALLOCATABLE, TARGET :: UPCHEMI

REAL, DIMENSION(:,:,:) , POINTER :: UPCHEM

!$acc declare create( UPCHEM )

!$acc declare create( UPCHEMI )

CONTAINS

SUBROUTINE allocate_DMP_mf_pointers(KTS,KTE,NUP,nchem,ITS,ITE)

!$acc routine seq

INTEGER, INTENT(IN) :: KTS,KTE,NUP,nchem,ITS,ITE

ALLOCATE ( UPCHEMI (KTS:KTE+1,1:NUP,1:nchem,ITS:ITE) )

END SUBROUTINE allocate_DMP_mf_pointers

SUBROUTINE set_DMP_mf_pointers(index)

!$acc routine seq

INTEGER,INTENT(in) :: index

UPCHEM => UPCHEMI (:,:,:,index)

END SUBROUTINE set_DMP_mf_pointers

END MODULE DMP_mf_pointers

And when I compile it I get:

1110> pgf90 -acc -Minfo=all -c testptrs.F90

NVFORTRAN-W-1054-Module variables used in acc routine need to be in acc declare create() - upchemi$dev (testptrs.F90: 12)

0 inform, 1 warnings, 0 severes, 0 fatal for allocate_dmp_mf_pointers

set_dmp_mf_pointers:

22, Generating acc routine seq

Generating NVIDIA GPU code

What am I doing wrong?

Thanks,

Jacques

MatColgrove · May 22, 2023, 3:56pm

Allocation and pointer assignment should be done on the host side.

Fortran pointer assignment is not supported on the device. If you do need device pointer assignment, you’ll need to switch to use Cray pointers (which are basically C pointers).

While allocation is supported on the device, it’s not recommended given the device heap is small so a program can easily overflow the heap and it can hurt performance.

Also here, it would be illegal OpenACC. The device and host copies of an array in a data region need to match and an allocate will create both a host and device copy.

Now you can use the pointer in device code, but do need to “attach” it so it’s pointing at the correct device address.

Something like the following:

% cat test2.f90
MODULE DMP_mf_pointers

!This module replaces automatic arrays with pointers for the GPU

IMPLICIT NONE

REAL, DIMENSION(:,:,:,:), ALLOCATABLE, TARGET :: UPCHEMI
REAL, DIMENSION(:,:,:) , POINTER :: UPCHEM

!$acc declare create( UPCHEMI )
!$acc declare create( UPCHEM )
CONTAINS

SUBROUTINE allocate_DMP_mf_pointers(KTS,KTE,NUP,nchem,ITS,ITE)
INTEGER, INTENT(IN) :: KTS,KTE,NUP,nchem,ITS,ITE
! OpenACC allocates both the host and device copies of the array
! when the array is in a declare create directive
ALLOCATE ( UPCHEMI (KTS:KTE+1,1:NUP,1:nchem,ITS:ITE) )
END SUBROUTINE allocate_DMP_mf_pointers

SUBROUTINE set_DMP_mf_pointers(index)
INTEGER,INTENT(in) :: index
UPCHEM => UPCHEMI (:,:,:,index)
! Update the device pointer to point at the correct UPCHEMI index
!$acc enter data attach(UPCHEM)
END SUBROUTINE set_DMP_mf_pointers

END MODULE DMP_mf_pointers

program foo

use DMP_mf_pointers

integer :: KTS,KTE,NUP,nchem,ITS,ITE,idx,i,j,k

KTS=1
KTE=64
NUP=64
nchem=64
ITS=1
ITE=64

call allocate_DMP_mf_pointers(KTS,KTE,NUP,nchem,ITS,ITE)
do idx = ITS,ITE
   call set_DMP_mf_pointers(idx)
!$acc parallel loop collapse(3) present(UPCHEM)
   do i=KTS,KTE
   do j=1,NUP
   do k=1,nchem
     UPCHEM(i,j,k) = 1
   enddo
   enddo
   enddo
enddo
!$acc update self(UPCHEMI)
print *, UPCHEMI(1,1,1,1)

end program

% nvfortran test2.f90 -acc -Minfo=accel; a.out
set_dmp_mf_pointers:
     24, Generating enter data attach(upchem)
foo:
     45, Generating NVIDIA GPU code
         46, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
         47,   ! blockidx%x threadidx%x collapsed
         48,   ! blockidx%x threadidx%x collapsed
     54, Generating update self(upchemi(:,:,:,:))
    1.000000

jacques.middlecoff · May 22, 2023, 9:08pm

Hi Mat,

I do need device pointer assignment because the code is of the form:

!$acc kernels(…
DO idx = its,ite
CALL SUBROUTINE sub(idx,jmax,kmax)

Where subroutine sub is of the form:
SUBROUTINE sub(idx,jmax,kmax)
.
.
.
!!! REAL, DIMENSION(jmax,kmax) :: upchem

call set_pointers(idx) ! Where set_pointers does: upchem => upchemi(:,:,idx)

upchem(1:jmax,1:kmax) = 0.0

But it seemsI can’t use Cray pointers either because Cray pointers must be scalars, not arrays.

So it looks like I can’t use pointers to replace automatic arrays like

REAL, DIMENSION(jmax,kmax) :: upchem

Jacques

MatColgrove · May 22, 2023, 9:54pm

Even if you could do pointer assignment, given UPCHEM is a shared module variable, you’d have a race condition. Each thread need it’s own private copy.

So it looks like I can’t use pointers to replace automatic arrays like

Typically this is done by hoisting declaration of the automatic array to the caller and then add it to a “private” clause on the OpenACC compute region. Then pass the private array into the subroutine.

Though it seems that you’re manually privatizing it using UPCHEMI. Why not access UPCHEMI directly? Are you trying to minimize the amount of code changes?

jacques.middlecoff · May 22, 2023, 10:53pm

I thought I could put UPCHEM in a private statement so each thread would have its own pointer location. But, from what you wrote, apparently not.

YES! I’m trying to minimize the amount of code changes. My first solution, which works, was to pass in UPCHEM(:,:,IDX) though the calling sequence but there are seven subroutines with a lot of automatic variables and the developers do not like the extensive code changes.

The reason for the IDX index is to reduce the amount of space used by private variables.

Jacques

Topic		Replies	Views
Need advices for optimizing heart of CFD code Legacy PGI Compilers	11	7065	July 13, 2016
Using Fortran derived types and cuBLAS Legacy PGI Compilers	19	12047	June 24, 2016
Operators both on host and device functions Legacy PGI Compilers	21	10646	October 12, 2010
Call in OpenACC region to procedure 'pgf90_copy_f90_argl' Legacy PGI Compilers	10	11401	July 5, 2017
What is the issue of different values between running the code in serial and run it using OpenACC? Legacy PGI Compilers	15	1480	December 4, 2020
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	1816	November 30, 2020
Problems with FORTRAN Accelerator and subroutines Legacy PGI Compilers	21	11922	August 17, 2011
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	10948	November 27, 2013
Using classes in openACC nvc, nvc++ and nvfortran	11	725	March 20, 2023
OpenACC kernel running slower than expected Legacy PGI Compilers	4	1295	August 31, 2021

Can I specify vector length in a kernels region?

Related topics