Can I specify vector length in a kernels region?

jacques.middlecoff · January 16, 2023, 2:25am

In a kernels directive how do you specify the vector length?

I tried:
!$acc kernels loop gang(100) vector(128)
DO i=ITS,ITF. !Line 4990
DO k=KTS,KTE !KTF

and it gave the error:
NVFORTRAN-S-0155-vector(x) not allowed in a kernels region having vector_length (module_bl_mynn.F90: 4990)

Then I found https://www.openacc.org/sites/default/files/inline-files/OpenACC_Programming_Guide_0.pdf

1 !$acc kernels
2 !$acc loop gang
3 do j=1,M
4 !$acc loop vector(128)
5 do i=1,N
6
7 !$acc end kernels

but when I try it:

!$acc kernels
!$acc loop gang
DO i=ITS,ITF
!$acc loop vector(128)
DO k=KTS,KTE !Line 4991

I get the same error:
NVFORTRAN-S-0155-vector(x) not allowed in a kernels region having vector_length (module_bl_mynn.F90: 4991)

I don’t know why the compiler thinks vector_length has been specified. What I’m doing wrong?

Thanks,

Jacques

MatColgrove · January 17, 2023, 5:40pm

Hi Jacques,

Do you have a minimal reproducing example that shows the error including the complication line, as well as the compiler version that you’re using?

“vector(N)” should be fine within a kernels region so it’s unclear what’s wrong here. For example, it works correctly in this example:

% cat test.f90
  program main

     integer, allocatable, dimension(:,:) :: Arr
     integer :: N,M, i, j
     N = 64
     M = 64
     allocate(Arr(N,M))

!$acc data copyout(Arr)
!$acc kernels
!$acc loop gang
     do i=1,N
!$acc loop vector(128)
       do j=1,M
          Arr(i,j) = ((i-1)*N)+j
       enddo
    enddo
!$acc end kernels
!$acc end data
    print *, Arr(2,:)
    deallocate(Arr)
  end
% nvfortran test.f90 -Minfo=accel -acc
main:
      9, Generating copyout(arr(:,:)) [if not already present]
     12, Loop is parallelizable
     14, Loop is parallelizable
         Generating NVIDIA GPU code
         12, !$acc loop gang, vector(4) ! blockidx%x threadidx%x
         14, !$acc loop vector(128) ! threadidx%y
             Interchanging generated vector loop outwards

Mat

jacques.middlecoff · January 17, 2023, 6:18pm

Hi Matt,

Thanks for the timely response.

I don’t have a simple example but I could generate one.

The compiler I’m using is:

module load cuda/10.1 nvhpc/22.2

The question I’m most interested in is this: Can the vector length be specified in the kernels directive such as:

!$acc kernels loop gang(100) vector(128)

Thanks,

Jacques

MatColgrove · January 17, 2023, 6:46pm

Sure. Typically it’s better to let the compiler determine the schedule to use based on the loop trip count, especially the number of gangs, but it’s valid.

The error above indicates that the code was using both “vector(128)” and “vector_length(128)”, but your post didn’t show this. I’m guessing there some missing information and why I was asking for a full example.

jacques.middlecoff · January 17, 2023, 6:57pm

The reason I want to specify the vector length is this: The compiler chose a vector length of 32 but all my loops are from 1,128 so I want to try a vector length of 128. Is that a good idea?

I’ll construct a simple example.

Thanks,

Jacques

MatColgrove · January 17, 2023, 7:51pm

For single level vector loops, the compiler typically uses a vector length of 128. It only goes lower if the loop trip count is known at compile time and smaller than 128.

However when it splits vector across multiple loops (via tiling), it will use a 32x4 but still 128 total. Could this be the case here? The compiler feedback messages (i.e. -Minfo=accel) will let you know the schedule in use.

For example, if I change “vector(128)” to just “vector” in the above example, the compiler is using a 32x4 vector block and parallelizing both loops across gangs and vectors. Here’s the compiler feed back:

% nvfortran -acc -Minfo=accel test.f90
main:
      9, Generating copyout(arr(:,:)) [if not already present]
     11, Loop is parallelizable
     13, Loop is parallelizable
         Generating NVIDIA GPU code
         11, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
         13, !$acc loop gang, vector(4) ! blockidx%y threadidx%y

jacques.middlecoff · January 17, 2023, 9:19pm

Hi Mat,

I compiled with

nvfortran -acc -Minfo=accel machine.F90 physcons.F90 module_bl_mynn.F90 > & cop

Here is the main loop (I loop) in the code. the I loop is line 4986 and the first K loop is line 4897:

!$acc kernels loop gang private( qcn,thvn,qsq1,qnwfa1,qv1,sh,el,det_thl,qc1,qi1, &

.

!$acc Du1, Dv1, Dth1, Dqv1 )

DO i=ITS,ITF

DO k=KTS,KTE !KTF .

.

and the compilation output has

4986, Loop is parallelizable

Generating NVIDIA GPU code

4986, !$acc loop gang ! blockidx%x

4987, !$acc loop vector(32) ! threadidx%x

5294, !$acc loop vector(32) ! threadidx%x

5393, !$acc loop vector(32) ! threadidx%x

5474, !$acc loop vector(32) ! threadidx%x

5543, !$acc loop vector(32) ! threadidx%x

5568, !$acc loop vector(32) ! threadidx%x

4987, Loop is parallelizable

5294, Loop is parallelizable

5393, Loop is parallelizable

5474, Loop is parallelizable

5543, Loop is parallelizable

5568, Loop is parallelizable

So it’s not quite like your example. Is it splitting across all six K loops?

Thanks

Jacques

MatColgrove · January 17, 2023, 10:37pm

No, it looks like you have 6 inner loops. So each loop is getting parallelized across the vectors, but the compiler isn’t splitting them between gangs and vectors.

I’m not sure why it’s only using 32 vectors. Typically it would only do that if the loop trip count of the “k” loops are known to be small or if you’re calling vector routine. Might be something else as well, but I’d need to see the code to determine.

Though back to your original question, this would be a case we’re you can add “vector_length” to the kernel directive so the length is applied all the vector loops as opposed to using “vector(128)” on each of the individual loops. You just can’t use both vector_length(128) and vector(128) together.

jacques.middlecoff · January 17, 2023, 11:44pm

I compiled my code containing the kernels directive

!$acc kernels loop gang private( qcn,thvn,qsq1,qnwfa1,qv1,sh,el,det_thl,qc1,qi1, &

with

nvfortran -acc -Minfo=accel machine.F90 physcons.F90 module_bl_mynn.F90 > & cop1

Then I changed the kernels directive to

!$acc kernels loop gang vector_length(128) private( qcn,thvn,qsq1,qnwfa1,qv1,sh,el,det_thl,qc1,qi1, &

and compiled with

nvfortran -acc -Minfo=accel machine.F90 physcons.F90 module_bl_mynn.F90 > & cop2

but apparently nvfortran ignored vector_lenght(128) because cop1 and cop2 are identical, both having vector lengths of 32.

Jacques

MatColgrove · January 18, 2023, 4:15pm

“vector_length” and “vector(N)” are just suggestions to the compiler, which is free to ignore them if there’s a reason.

You can try using “parallel” instead of “kernels” so the compiler has less freedom, but it still may override.

Exactly why it’s using a vector length of 32 here, I’m not sure.

jacques.middlecoff · January 18, 2023, 4:44pm

The outer I loop is 10240. Could it be using a vector length of 32 because it uses a large number of gangs? The compiler output did not show how many gangs were being used.

MatColgrove · January 18, 2023, 5:09pm

The number of gangs typically isn’t fixed but instead dynamically set at runtime depending of the loop trip count.

What’s the loop trip count of the inner loops? Is it known at compile time (i.e. set via parameters)?

jacques.middlecoff · January 18, 2023, 5:25pm

The trip count on the inner loops is 128. It’s set in a main routine that calls the working routine. It’s set by a regular fortran expression (levs=128). I could try setting is as a parameter.

jacques.middlecoff · January 19, 2023, 5:13am

Hi Mat,

I made the inner loop trip count 128 by specifying it directly (DO K=1,128) and the vector length was still 32. So I began to think the cause must be the loop itself. The special thing about the outer I loop is that it is very large. It contains many large inner K loops and calls five large subroutines which themselves contain large K loops. So could it be the large size of the outer I loop that causes the compiler to chose a vector length of 32?

Thanks,

Jacques

MatColgrove · January 19, 2023, 6:24pm

So could it be the large size of the outer I loop that causes the compiler to chose a vector length of 32?

No unless it’s also applying vector to the outer loop, but given the output you shared, it does appear to be the case. If you have a “routine vector” function call, that would do it as well, but that doesn’t seem to be the case here either.

If you can get me a reproducing example, I might be able determine why. Otherwise I’m not sure.

jacques.middlecoff · January 19, 2023, 6:33pm

I’ll try removing parts of the main loop to see how small I can get it and still have vector(32). Might be instructive to see when (if) vector(32) changes to vector(128).

Jacques

jacques.middlecoff · January 19, 2023, 8:14pm

Hi Mat,

First I made the I loop (starting at 4986) and the inner K loops explicit: DO I=1,10240 and DO K=1,128.
Then in the main I loop I removed the subroutine calls one by one and when I removed the last one the compiler output changed from:

4986, Loop is parallelizable

Generating NVIDIA GPU code

4986, !$acc loop gang ! blockidx%x

4987, !$acc loop vector(32) ! threadidx%x

5294, !$acc loop vector(32) ! threadidx%x

5393, !$acc loop vector(32) ! threadidx%x

5474, !$acc loop vector(32) ! threadidx%x

5543, !$acc loop vector(32) ! threadidx%x

5568, !$acc loop vector(32) ! threadidx%x

to

4986, Loop is parallelizable

Generating NVIDIA GPU code

4986, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

4987, !$acc loop seq

5407, !$acc loop seq

5492, !$acc loop seq

5561, !$acc loop seq

5586, !$acc loop seq

so it’s not as straightforward as I had imagined. Is it reasonable to vectorize the I loop and make the inner loops seq?

Thanks,

Jacques

MatColgrove · January 19, 2023, 9:08pm

Ok, so you do have subroutine calls in the loop which I assume are decorated with “!$acc routine vector”. Use of vector routines forces the vector length to be 32 in order to support reductions in the routines plus reducing the need to include thread synchronization.

You’ll need to make a choice of changing these to be “routine seq” and removing “loop vector” for the parallel loops in the routines, or keep the loops in the main body as vector length 32.

-Mat

jacques.middlecoff · January 20, 2023, 1:50am

I though I would try the other way so I changed all “routine vector” to “routine seq” and removed all the “!$acc loop vector” and left the kernels directive the same:

!$acc kernels loop gang private( qcn,thvn,qsq1,qnwfa1,qv1,sh,el,det_thl,qc1,qi1, &

The compilation looked good:

4984, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

4985, !$acc loop seq

5292, !$acc loop seq

5391, !$acc loop seq

5472, !$acc loop seq

5541, !$acc loop seq

5566, !$acc loop seq

But when I ran it I got:

Calling init

Calling run

FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED

Failing in Thread:1

call to cuStreamSynchronize returned error 719: Launch failed (often invalid pointer dereference)

Jacques

MatColgrove · January 20, 2023, 3:49pm

Given the error, I’m assuming that you have automatic arrays in your device subroutines?

While supported on the device, use of automatics on the device is discouraged. Automatics are implicitly allocated upon entry into the subroutine. Besides being slow due to serialization of the allocation, the default heap size on the device is quite small (~8MB) which can lead to a heap overflow. This is likely what’s happening here.

You can increase the heap size by setting the environment variable “NV_ACC_CUDA_HEAPSIZE”, or revert back to using routine vector (in which case only one array per gang is allocated as opposed to one per thread), however, the performance issue may still occur (though less so in the latter case).

You can try making them fixed size, though depending on the size, you may then start to encounter stack overflows. Another method would be to make the arrays private on the compute region in the main loop and then pass them into the subroutine.

-Mat

Topic		Replies	Views
Need advices for optimizing heart of CFD code Legacy PGI Compilers	11	7065	July 13, 2016
Using Fortran derived types and cuBLAS Legacy PGI Compilers	19	12047	June 24, 2016
Operators both on host and device functions Legacy PGI Compilers	21	10646	October 12, 2010
Call in OpenACC region to procedure 'pgf90_copy_f90_argl' Legacy PGI Compilers	10	11401	July 5, 2017
What is the issue of different values between running the code in serial and run it using OpenACC? Legacy PGI Compilers	15	1480	December 4, 2020
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	1816	November 30, 2020
Problems with FORTRAN Accelerator and subroutines Legacy PGI Compilers	21	11922	August 17, 2011
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	10948	November 27, 2013
Using classes in openACC nvc, nvc++ and nvfortran	11	725	March 20, 2023
OpenACC kernel running slower than expected Legacy PGI Compilers	4	1295	August 31, 2021

Can I specify vector length in a kernels region?

Related topics