Can I specify vector length in a kernels region?

In a kernels directive how do you specify the vector length?

I tried:
!$acc kernels loop gang(100) vector(128)
DO i=ITS,ITF. !Line 4990
DO k=KTS,KTE !KTF

and it gave the error:
NVFORTRAN-S-0155-vector(x) not allowed in a kernels region having vector_length (module_bl_mynn.F90: 4990)

Then I found https://www.openacc.org/sites/default/files/inline-files/OpenACC_Programming_Guide_0.pdf

1 !$acc kernels
2 !$acc loop gang
3 do j=1,M
4 !$acc loop vector(128)
5 do i=1,N
6
7 !$acc end kernels

but when I try it:

!$acc kernels
!$acc loop gang
DO i=ITS,ITF
!$acc loop vector(128)
DO k=KTS,KTE !Line 4991

I get the same error:
NVFORTRAN-S-0155-vector(x) not allowed in a kernels region having vector_length (module_bl_mynn.F90: 4991)

I don’t know why the compiler thinks vector_length has been specified. What I’m doing wrong?

Thanks,

Jacques

Hi Jacques,

Do you have a minimal reproducing example that shows the error including the complication line, as well as the compiler version that you’re using?

“vector(N)” should be fine within a kernels region so it’s unclear what’s wrong here. For example, it works correctly in this example:

% cat test.f90
  program main

     integer, allocatable, dimension(:,:) :: Arr
     integer :: N,M, i, j
     N = 64
     M = 64
     allocate(Arr(N,M))

!$acc data copyout(Arr)
!$acc kernels
!$acc loop gang
     do i=1,N
!$acc loop vector(128)
       do j=1,M
          Arr(i,j) = ((i-1)*N)+j
       enddo
    enddo
!$acc end kernels
!$acc end data
    print *, Arr(2,:)
    deallocate(Arr)
  end
% nvfortran test.f90 -Minfo=accel -acc
main:
      9, Generating copyout(arr(:,:)) [if not already present]
     12, Loop is parallelizable
     14, Loop is parallelizable
         Generating NVIDIA GPU code
         12, !$acc loop gang, vector(4) ! blockidx%x threadidx%x
         14, !$acc loop vector(128) ! threadidx%y
             Interchanging generated vector loop outwards
  • Mat

Hi Matt,

Thanks for the timely response.

I don’t have a simple example but I could generate one.

The compiler I’m using is:

module load cuda/10.1 nvhpc/22.2

The question I’m most interested in is this: Can the vector length be specified in the kernels directive such as:

!$acc kernels loop gang(100) vector(128)

Thanks,

Jacques

Jacques

Sure. Typically it’s better to let the compiler determine the schedule to use based on the loop trip count, especially the number of gangs, but it’s valid.

The error above indicates that the code was using both “vector(128)” and “vector_length(128)”, but your post didn’t show this. I’m guessing there some missing information and why I was asking for a full example.

The reason I want to specify the vector length is this: The compiler chose a vector length of 32 but all my loops are from 1,128 so I want to try a vector length of 128. Is that a good idea?

I’ll construct a simple example.

Thanks,

Jacques

For single level vector loops, the compiler typically uses a vector length of 128. It only goes lower if the loop trip count is known at compile time and smaller than 128.

However when it splits vector across multiple loops (via tiling), it will use a 32x4 but still 128 total. Could this be the case here? The compiler feedback messages (i.e. -Minfo=accel) will let you know the schedule in use.

For example, if I change “vector(128)” to just “vector” in the above example, the compiler is using a 32x4 vector block and parallelizing both loops across gangs and vectors. Here’s the compiler feed back:

% nvfortran -acc -Minfo=accel test.f90
main:
      9, Generating copyout(arr(:,:)) [if not already present]
     11, Loop is parallelizable
     13, Loop is parallelizable
         Generating NVIDIA GPU code
         11, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
         13, !$acc loop gang, vector(4) ! blockidx%y threadidx%y

Hi Mat,

I compiled with

nvfortran -acc -Minfo=accel machine.F90 physcons.F90 module_bl_mynn.F90 > & cop

Here is the main loop (I loop) in the code. the I loop is line 4986 and the first K loop is line 4897:

!$acc kernels loop gang private( qcn,thvn,qsq1,qnwfa1,qv1,sh,el,det_thl,qc1,qi1, &

.

.

.

!$acc Du1, Dv1, Dth1, Dqv1 )

DO i=ITS,ITF

DO k=KTS,KTE !KTF .

.

.

and the compilation output has

4986, Loop is parallelizable

Generating NVIDIA GPU code

4986, !$acc loop gang ! blockidx%x

4987, !$acc loop vector(32) ! threadidx%x

5294, !$acc loop vector(32) ! threadidx%x

5393, !$acc loop vector(32) ! threadidx%x

5474, !$acc loop vector(32) ! threadidx%x

5543, !$acc loop vector(32) ! threadidx%x

5568, !$acc loop vector(32) ! threadidx%x

4987, Loop is parallelizable

5294, Loop is parallelizable

5393, Loop is parallelizable

5474, Loop is parallelizable

5543, Loop is parallelizable

5568, Loop is parallelizable

So it’s not quite like your example. Is it splitting across all six K loops?

Thanks

Jacques

No, it looks like you have 6 inner loops. So each loop is getting parallelized across the vectors, but the compiler isn’t splitting them between gangs and vectors.

I’m not sure why it’s only using 32 vectors. Typically it would only do that if the loop trip count of the “k” loops are known to be small or if you’re calling vector routine. Might be something else as well, but I’d need to see the code to determine.

Though back to your original question, this would be a case we’re you can add “vector_length” to the kernel directive so the length is applied all the vector loops as opposed to using “vector(128)” on each of the individual loops. You just can’t use both vector_length(128) and vector(128) together.

I compiled my code containing the kernels directive

!$acc kernels loop gang private( qcn,thvn,qsq1,qnwfa1,qv1,sh,el,det_thl,qc1,qi1, &

with

nvfortran -acc -Minfo=accel machine.F90 physcons.F90 module_bl_mynn.F90 > & cop1

Then I changed the kernels directive to

!$acc kernels loop gang vector_length(128) private( qcn,thvn,qsq1,qnwfa1,qv1,sh,el,det_thl,qc1,qi1, &

and compiled with

nvfortran -acc -Minfo=accel machine.F90 physcons.F90 module_bl_mynn.F90 > & cop2

but apparently nvfortran ignored vector_lenght(128) because cop1 and cop2 are identical, both having vector lengths of 32.

Jacques

“vector_length” and “vector(N)” are just suggestions to the compiler, which is free to ignore them if there’s a reason.

You can try using “parallel” instead of “kernels” so the compiler has less freedom, but it still may override.

Exactly why it’s using a vector length of 32 here, I’m not sure.

The outer I loop is 10240. Could it be using a vector length of 32 because it uses a large number of gangs? The compiler output did not show how many gangs were being used.

The number of gangs typically isn’t fixed but instead dynamically set at runtime depending of the loop trip count.

What’s the loop trip count of the inner loops? Is it known at compile time (i.e. set via parameters)?

The trip count on the inner loops is 128. It’s set in a main routine that calls the working routine. It’s set by a regular fortran expression (levs=128). I could try setting is as a parameter.

Hi Mat,

I made the inner loop trip count 128 by specifying it directly (DO K=1,128) and the vector length was still 32. So I began to think the cause must be the loop itself. The special thing about the outer I loop is that it is very large. It contains many large inner K loops and calls five large subroutines which themselves contain large K loops. So could it be the large size of the outer I loop that causes the compiler to chose a vector length of 32?

Thanks,

Jacques

So could it be the large size of the outer I loop that causes the compiler to chose a vector length of 32?

No unless it’s also applying vector to the outer loop, but given the output you shared, it does appear to be the case. If you have a “routine vector” function call, that would do it as well, but that doesn’t seem to be the case here either.

If you can get me a reproducing example, I might be able determine why. Otherwise I’m not sure.

I’ll try removing parts of the main loop to see how small I can get it and still have vector(32). Might be instructive to see when (if) vector(32) changes to vector(128).

Jacques

Hi Mat,

First I made the I loop (starting at 4986) and the inner K loops explicit: DO I=1,10240 and DO K=1,128.
Then in the main I loop I removed the subroutine calls one by one and when I removed the last one the compiler output changed from:

4986, Loop is parallelizable

Generating NVIDIA GPU code

4986, !$acc loop gang ! blockidx%x

4987, !$acc loop vector(32) ! threadidx%x

5294, !$acc loop vector(32) ! threadidx%x

5393, !$acc loop vector(32) ! threadidx%x

5474, !$acc loop vector(32) ! threadidx%x

5543, !$acc loop vector(32) ! threadidx%x

5568, !$acc loop vector(32) ! threadidx%x

to

4986, Loop is parallelizable

Generating NVIDIA GPU code

4986, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

4987, !$acc loop seq

5407, !$acc loop seq

5492, !$acc loop seq

5561, !$acc loop seq

5586, !$acc loop seq

so it’s not as straightforward as I had imagined. Is it reasonable to vectorize the I loop and make the inner loops seq?

Thanks,

Jacques

Ok, so you do have subroutine calls in the loop which I assume are decorated with “!$acc routine vector”. Use of vector routines forces the vector length to be 32 in order to support reductions in the routines plus reducing the need to include thread synchronization.

You’ll need to make a choice of changing these to be “routine seq” and removing “loop vector” for the parallel loops in the routines, or keep the loops in the main body as vector length 32.

-Mat

I though I would try the other way so I changed all “routine vector” to “routine seq” and removed all the “!$acc loop vector” and left the kernels directive the same:

!$acc kernels loop gang private( qcn,thvn,qsq1,qnwfa1,qv1,sh,el,det_thl,qc1,qi1, &

The compilation looked good:

4984, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

4985, !$acc loop seq

5292, !$acc loop seq

5391, !$acc loop seq

5472, !$acc loop seq

5541, !$acc loop seq

5566, !$acc loop seq

But when I ran it I got:

Calling init

Calling run

FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED

FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED

Failing in Thread:1

call to cuStreamSynchronize returned error 719: Launch failed (often invalid pointer dereference)

Jacques

Given the error, I’m assuming that you have automatic arrays in your device subroutines?

While supported on the device, use of automatics on the device is discouraged. Automatics are implicitly allocated upon entry into the subroutine. Besides being slow due to serialization of the allocation, the default heap size on the device is quite small (~8MB) which can lead to a heap overflow. This is likely what’s happening here.

You can increase the heap size by setting the environment variable “NV_ACC_CUDA_HEAPSIZE”, or revert back to using routine vector (in which case only one array per gang is allocated as opposed to one per thread), however, the performance issue may still occur (though less so in the latter case).

You can try making them fixed size, though depending on the size, you may then start to encounter stack overflows. Another method would be to make the arrays private on the compute region in the main loop and then pass them into the subroutine.

-Mat