OpenACC Loop Organization

Assume that there is a multiple loop inside a parallel region,

!$ACC PARALLEL LOOP GANG
DO k = …
// Something
DO i = 1, 2
DO j = 1, 4
// Something
ENDDO
ENDDO
// Something
END DO
!$ACC END PARALLEL

If I assign 8 workers to the loop inside, like

!$ACC PARALLEL LOOP GANG
DO k = …
// Something
!$ACC LOOP WORKER(8)
DO i = 1, 2
DO j = 1, 4
// Something
ENDDO
ENDDO
// Something
END DO
!$ACC END PARALLEL

How does it behave?

Does it behave like
Worker 1 : i = 1, j = 1
Worker 2 : i = 1, j = 2
Worker 3 : i = 1, j = 3
Worker 4 : i = 1, j = 4
Worker 5 : i = 2, j = 1
Worker 6 : i = 2, j = 2
Worker 7 : i = 2, j = 3
Worker 8 : i = 2, j = 4

or
Worker 1 : i = 1, j = 1, 2, 3, 4
Worker 2 : i = 2, j = 1, 2, 3, 4
Worker 3 to 8 : Not Generated

What about if I make loops like

!$ACC PARALLEL LOOP GANG
DO k = …
// Something
!$ACC LOOP WORKER(2)
DO i = 1, 2
!$ACC LOOP WORKER(4)
DO j = 1, 4
// Something
ENDDO
ENDDO
// Something
END DO
!$ACC END PARALLEL

Is this gramatically correct? If so, how does it behave?

Hi CNJ,

It would behave like the second where only two worker are used, unless the compiler automatic adds “vector” to the “j” loop (PGI might does this if it can detect if “j” is independent, but it’s not mandatory)

Nesting “worker” clauses is technically illegal. Instead, either collapse the loops, or use a “vector” clause.

!$ACC LOOP WORKER collapse(2)
 DO i = 1, 2 
 DO j = 1, 4



!$ACC LOOP WORKER
 DO i = 1, 2 
!$AC LOOP VECTOR
 DO j = 1, 4
  • Mat

!$ACC LOOP WORKER collapse(2)
DO i = 1, 2
DO j = 1, 4

Does this always guarantee that the compiler launches 8 workers?

If not, by which way can I achieve it?

Does this always guarantee that the compiler launches 8 workers?

No. The compiler can choose based upon the target device.

If not, by which way can I achieve it?

Add the “num_workers(8)” clause to the “parallel” directive.

Or “worker(8)” if you’re using a “kernels” directive.


I would recommend you not explicitly set the number of gangs, workers, or vectors. This way the compiler can decide the best case based upon the target device.

Note that on NVIDIA devices, the minimum number of threads used is 32. So if you set workers to 8 (and have no vector loop), then you’ll have 24 idle threads.

  • Mat