"!$acc routine vector" leads to 32 threads per gan

Hi all,

I’m trying to offload some existing OpenMP code that uses threading for the outer loop and vectorization for the inner loop. The vector operations are contained in separate routines, not inlined within the outer loop body.

When I switch this to OpenACC, I use a gang loop for the outer loop and then I decorate the various functions with “!$acc routine vector” plus vector loop statements as needed.

My test codes are compiling, running, and producing correct results but I notice that the vector length is always (32) regardless of what I specify in the parallel declaration. This will make getting good occupancy very hard.

If I inline the function’s loop body, the vector length matches to what I’m specifying in the vector_length clause.

I noticed that in this PGI brochure

https://www.pgroup.com/lit/brochures/openacc_sc14.pdf

that a vector_length(32) is implicitly added. My real code is very large so inlining is not an option.

Is there a way to avoid the vector length being restricted to 32 with calling ‘vector’ routines?

I have experimented with adding several workers with vector_length(32) so that I get num_workers * vector_length threads active per gang. That is, instead of vector inner loops, I have ‘worker(4), vector(32)’ loops and I decorate the functions as ‘worker’. This produces correct results on my test codes and seems to work with reduction and normal loops.

This is the only way I’ve found of increasing the # of threads per gang when calling $acc routines. Is this the correct approach?

Thanks,

Chris

Hi Chris,

Is there a way to avoid the vector length being restricted to 32 with calling ‘vector’ routines?

There is an undocumented flag, “-ta=tesla:gvmode” (Gang-Vector mode), that will switch back to our old method of allowing vector lengths of greater than 32 for vector routines. However, we put in this limitation since we found that the performance to be better for most codes. Having a vector length greater than 32 requires significantly more thread synchronization calls which can slow down codes. Also for reductions in vector routines, we have to use a different implementation method which is slower as well.

Give it a try and please let us know if it helps. If so, I may ask if we can document the flag again.

-Mat