OpenACC Loop Organization

chlskawo12 · February 2, 2016, 7:53am

Assume that there is a multiple loop inside a parallel region,

!$ACC PARALLEL LOOP GANG
DO k = …
// Something
DO i = 1, 2
DO j = 1, 4
// Something
ENDDO
ENDDO
// Something
END DO
!$ACC END PARALLEL

If I assign 8 workers to the loop inside, like

!$ACC PARALLEL LOOP GANG
DO k = …
// Something
!$ACC LOOP WORKER(8)
DO i = 1, 2
DO j = 1, 4
// Something
ENDDO
ENDDO
// Something
END DO
!$ACC END PARALLEL

How does it behave?

Does it behave like
Worker 1 : i = 1, j = 1
Worker 2 : i = 1, j = 2
Worker 3 : i = 1, j = 3
Worker 4 : i = 1, j = 4
Worker 5 : i = 2, j = 1
Worker 6 : i = 2, j = 2
Worker 7 : i = 2, j = 3
Worker 8 : i = 2, j = 4

or
Worker 1 : i = 1, j = 1, 2, 3, 4
Worker 2 : i = 2, j = 1, 2, 3, 4
Worker 3 to 8 : Not Generated

What about if I make loops like

!$ACC PARALLEL LOOP GANG
DO k = …
// Something
!$ACC LOOP WORKER(2)
DO i = 1, 2
!$ACC LOOP WORKER(4)
DO j = 1, 4
// Something
ENDDO
ENDDO
// Something
END DO
!$ACC END PARALLEL

Is this gramatically correct? If so, how does it behave?

MatColgrove · February 2, 2016, 6:07pm

Hi CNJ,

It would behave like the second where only two worker are used, unless the compiler automatic adds “vector” to the “j” loop (PGI might does this if it can detect if “j” is independent, but it’s not mandatory)

Nesting “worker” clauses is technically illegal. Instead, either collapse the loops, or use a “vector” clause.

!$ACC LOOP WORKER collapse(2)
 DO i = 1, 2 
 DO j = 1, 4

!$ACC LOOP WORKER
 DO i = 1, 2 
!$AC LOOP VECTOR
 DO j = 1, 4

Mat

chlskawo12 · February 5, 2016, 12:16am

!$ACC LOOP WORKER collapse(2)
DO i = 1, 2
DO j = 1, 4

Does this always guarantee that the compiler launches 8 workers?

If not, by which way can I achieve it?

MatColgrove · February 5, 2016, 12:36am

Does this always guarantee that the compiler launches 8 workers?

No. The compiler can choose based upon the target device.

If not, by which way can I achieve it?

Add the “num_workers(8)” clause to the “parallel” directive.

Or “worker(8)” if you’re using a “kernels” directive.

I would recommend you not explicitly set the number of gangs, workers, or vectors. This way the compiler can decide the best case based upon the target device.

Note that on NVIDIA devices, the minimum number of threads used is 32. So if you set workers to 8 (and have no vector loop), then you’ll have 24 idle threads.

Mat

Topic		Replies	Views
Questions about "parallel" and "loop" Legacy PGI Compilers	1	2619	August 5, 2015
Mapping between OpenACC and CUDA parallelism levels Legacy PGI Compilers	3	6544	April 16, 2015
paralle + independent and kernels + vector_length() Legacy PGI Compilers	5	4029	August 20, 2012
Clause 'Worker(value)' not allowed in 'Parallel Loop' direct Legacy PGI Compilers	2	1828	April 17, 2018
OpenACC parallel loop gang, vector Legacy PGI Compilers	4	6444	December 7, 2023
should use to "acc reduction" in an inner loop Legacy PGI Compilers	4	4181	December 6, 2012
What does it mean by "loop is parallelizable" Legacy PGI Compilers	1	2461	July 31, 2015
grouping specific loops into a kernel Legacy PGI Compilers	1	1749	May 7, 2013
MatMul with openACC Legacy PGI Compilers	7	13019	December 17, 2012
OpenACC: Fine tuning accelerator performance nvc, nvc++ and nvfortran	5	1229	March 18, 2021

OpenACC Loop Organization

Related topics