Not sure where this is coming from, but can work around it if vector_length is also set (See below). Though workers and vectors wont be used here since you have no parallel loops. Instead “gang redundant mode” applies (i.e. each gang will execute the print redundantly), so using “num_workers” here isn’t really valid.
% cat test.f90
program foo
!$acc parallel num_gangs(4) num_workers(3) vector_length(1)
print*,"abc"
!$acc end parallel
end program foo
% pgfortran -ta=tesla test.f90 -Minfo=accel; a.out
foo:
2, Generating Tesla code
abc
abc
abc
abc
Thanks. Actually, I think “num_gangs” together with “num_workers” should be valid, of course, if I am not missing anything. I made up this example based on a similar one (Figure 15.5) in “Programming Massively Parallel Processors: A Hands-on Approach” by D.B.Kirk and W.W.Hwu, which is as follows:
So, just to clarify for myself, “parallel” without loop, only has “num_gangs” copies executed, independent of “num_workers” and “vector_length” sizes, whereas “parallel loop num_gangs(a) num_workers(b) vector_length(c)” would divide the total number of loop executions into “abc” independent chunks. Am I right?
So, just to clarify for myself, “parallel” without loop, only has “num_gangs” copies executed, independent of “num_workers” and “vector_length” sizes, whereas “parallel loop num_gangs(a) num_workers(b) vector_length(c)” would divide the total number of loop executions into “abc” independent chunks.
Correct. A parallel region without a work-shared loop (i.e. without a loop directive) is run in “gang-redundant mode”, each gang executes the same exectutable statements. So it’s not that using “num_workers” here is invalid, it’s just not applicable in this case.
If you did have a loop(s) with a gang, worker, and/or vector clause, then, yes, it would divide the loop across the total number of gangs, workers, and vectors. Though best practice is to not specifically set the number of gangs, workers, or vectors except for specific tuning of algorithms and instead let the compiler and runtime define these values based on the target device. Different target devices may need different values so setting these size will reduce performance portability.