about gang and worker

teslalady · November 19, 2012, 10:46am

The code as below:
!$acc kernels
DO j=jtf,jtg
DO k=kts,kte
DO i=its,ite
work(i,k,j)=dc05*(u0(i,k,j)+u0(i-1,k,j))
ENDDO
ENDDO
ENDDO

the compiler information is as
3420, Loop is parallelizable
3421, Loop is parallelizable
3422, Loop is parallelizable
Accelerator kernel generated
3420, Cached references to size [(x+1)x(y)] block of ‘u0’
3421, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
3422, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 1.0 : 26 registers; 112 shared, 8 constant, 0 local memory bytes
CC 2.0 : 22 registers; 0 shared, 124 constant, 0 local memory bytes

My questions are:
1.Is the parallel of J-orientation gang-level? Why the compiler did not acclocate the parallel of worker-level ?

2.from the 3420 line, Cached references to size [(x+1)x(y)] block of ‘u0’ ,what’s that meaning? Is this result created by the directive of $acc cache (u0)?

MatColgrove · November 19, 2012, 4:56pm

Hi Telsalady,

From the compiler feedback messages, it appears to me that the compiler has made the schedule out of the k and i loops and put the j loop in the kernel. The reasoning being that this loop doesn’t have much work so by serializing j, you can increase the amount of work each kernel does. Especially since the code utilizes cached memory, this will help in increasing the computational intensity.

While not guaranteed, the compiler usually does a good job at finding an optimal schedule. Though, you can use the loop directives to override the compiler’s default schedule if you want to try others.

Why the compiler did not acclocate the parallel of worker-level ?

The “worker” construct on an Nvidia device corresponds to the warp size. The warp size is fixed at 32 so can’t be changed.

2.from the 3420 line, Cached references to size [(x+1)x(y)] block of ‘u0’ ,what’s that meaning? Is this result created by the directive of $acc cache (u0)?

This is the compiler auto-detecting where to apply caching. You could used the cache directive, but the PGI compiler will automatically find opportunities.

Mat

teslalady · November 20, 2012, 4:12pm

Thanks Mat!

I have some confuse about Copy clause. Is Copy clause that copy date from host to the global memory of GPU?

Thanks again

MatColgrove · November 21, 2012, 4:26pm

The “copy” clause allocates memory on the device and then copies the data to the devices global memory at the start of the region (data or compute). At the end of the region, the data is copied back to the host and the memory is deallocated on the device.

The “copyin” clause copies the data to the device, but does not copy it back to the host. While the “copyout” clause only copies the data back to the host but does not copy it to the device.

The “create” clause only allocates and deallocated the device memory, but does not perform any copies.

The “update” directive can be use within a data region to copy data to/from the device at specific points in your program.

Mat

Topic		Replies	Views
PGI Acc: Matrix-matrix-multiplication Legacy PGI Compilers	3	5256	September 10, 2010
Using the cache directive Legacy PGI Compilers	1	2262	June 10, 2013
OpenACC Loop Organization Legacy PGI Compilers	3	2382	February 5, 2016
gang and worker Legacy PGI Compilers	3	2425	May 7, 2013
How to parallelize this loop... Legacy PGI Compilers	14	7987	December 18, 2012
Clause 'Worker(value)' not allowed in 'Parallel Loop' direct Legacy PGI Compilers	2	1906	April 17, 2018
how gang and vector parallelization of a loop map to the GPU Legacy PGI Compilers	5	8158	February 26, 2014
questions about #threads Legacy PGI Compilers	5	4179	August 3, 2015
Questions about "parallel" and "loop" Legacy PGI Compilers	1	2681	August 5, 2015
paralle + independent and kernels + vector_length() Legacy PGI Compilers	5	4155	August 20, 2012

about gang and worker

Related topics