I’m trying to offload some existing OpenMP code that uses threading for the outer loop and vectorization for the inner loop. The vector operations are contained in separate routines, not inlined within the outer loop body.
When I switch this to OpenACC, I use a gang loop for the outer loop and then I decorate the various functions with “!$acc routine vector” plus vector loop statements as needed.
My test codes are compiling, running, and producing correct results but I notice that the vector length is always (32) regardless of what I specify in the parallel declaration. This will make getting good occupancy very hard.
If I inline the function’s loop body, the vector length matches to what I’m specifying in the vector_length clause.
I noticed that in this PGI brochure
that a vector_length(32) is implicitly added. My real code is very large so inlining is not an option.
Is there a way to avoid the vector length being restricted to 32 with calling ‘vector’ routines?
I have experimented with adding several workers with vector_length(32) so that I get num_workers * vector_length threads active per gang. That is, instead of vector inner loops, I have ‘worker(4), vector(32)’ loops and I decorate the functions as ‘worker’. This produces correct results on my test codes and seems to work with reduction and normal loops.
This is the only way I’ve found of increasing the # of threads per gang when calling $acc routines. Is this the correct approach?