"!$acc routine vector" leads to 32 threads per gan

Hi all,

I’m trying to offload some existing OpenMP code that uses threading for the outer loop and vectorization for the inner loop. The vector operations are contained in separate routines, not inlined within the outer loop body.

When I switch this to OpenACC, I use a gang loop for the outer loop and then I decorate the various functions with “!$acc routine vector” plus vector loop statements as needed.

My test codes are compiling, running, and producing correct results but I notice that the vector length is always (32) regardless of what I specify in the parallel declaration. This will make getting good occupancy very hard.

If I inline the function’s loop body, the vector length matches to what I’m specifying in the vector_length clause.

I noticed that in this PGI brochure

https://www.pgroup.com/lit/brochures/openacc_sc14.pdf

that a vector_length(32) is implicitly added. My real code is very large so inlining is not an option.

Is there a way to avoid the vector length being restricted to 32 with calling ‘vector’ routines?

I have experimented with adding several workers with vector_length(32) so that I get num_workers * vector_length threads active per gang. That is, instead of vector inner loops, I have ‘worker(4), vector(32)’ loops and I decorate the functions as ‘worker’. This produces correct results on my test codes and seems to work with reduction and normal loops.

This is the only way I’ve found of increasing the # of threads per gang when calling $acc routines. Is this the correct approach?

Thanks,

Chris

Hi Chris,

Is there a way to avoid the vector length being restricted to 32 with calling ‘vector’ routines?

There is an undocumented flag, “-ta=tesla:gvmode” (Gang-Vector mode), that will switch back to our old method of allowing vector lengths of greater than 32 for vector routines. However, we put in this limitation since we found that the performance to be better for most codes. Having a vector length greater than 32 requires significantly more thread synchronization calls which can slow down codes. Also for reductions in vector routines, we have to use a different implementation method which is slower as well.

Give it a try and please let us know if it helps. If so, I may ask if we can document the flag again.

-Mat

Found this topic by coincidence and just wanted to point out that we have indeed a use-case where performance improves for larger vector lengths by up to 20%. There, loop iterations are independent and mapped to gang/vector parallelism, respectively, but can’t be collapsed. Therefore, thread synchronization and reduction operations aren’t adding too much overhead.

So, just a humble request not to remove this option at the very least, or maybe even to add documentation in the future.

Hi Balthasar.reuter,

Just to clarify, “gvmode” only applies to vector loops within vector routines, it doesn’t affect vector loops in compute regions. The vector length defaults to 128 in this case and can be modified via the vector_lenth clause.

Are you using vector routines?

-Mat

Hi Mat,

thanks for the clarification. Our main GPU adaptation recipe for a class of algorithms in our application relies indeed on vector routines. The control flow is, very simplified, as follows:

!$acc parallel loop gang vector_length(block_size)
do iblock=1,nblock
  call vector_routine(1, block_size, ...)
end do

with the vector routines exposing another level of parallelism as

subroutine vector_routine(istart, iend, ...)
!$acc routine vector
!$acc loop vector
do i=istart, iend
...
end do
end subroutine

with a fairly substantial amount of compute inside the vector loop but, notably, no data dependencies between loop iterations. Performance is limited by register file pressure. More complex variants of this control flow consist of multiple calls to vector routines from the same block loop, and nested calls inside the vector routine, which occasionally split the vector loop.

We recently found that the vector_length setting did not have any effect. But we have a variation of this recipe where the vector loop is hoisted to the caller side, making the routine sequential, where vector_length was working as expected. Without this, performance was affected for blocksizes != 128. Using “gvmode” we are less dependent on the block size, achieving comparable performance in particular for smaller block sizes, which improves interoparability with CPU code paths, where smaller block sizes improve cache efficiency.

Thanks again,
Balthasar