paralle + independent and kernels + vector_length()

PaulPa · August 2, 2012, 12:19pm

Hello,

I am currently working with the new pgi compiler 12.6 (linux) and I am still running into problems regarding some OpenACC directives.

The compiler does not seem to be cabable of handling the indepenet clauses within a parallel region. Is that suppossed to be that way or is this feature still missing?

I encounter a similar problem if I try to compile my code with a vector_length clause within a kernels region (the vector_length clause is part of a acc loop directive within the kernels region.). What am I doing wrong here?

Furthermore, I realized that the 12.6 compiler decides to shedule the workload among gangs and vectors eventhough I explicitly tell the compiler to schedule the work among gangs only. Is this behaviour expected or is it a bug?

If you guys want, I can provide you with the source code and/or file a bug-report.

Best,
Paul

MatColgrove · August 2, 2012, 4:47pm

Hi Paul,

The compiler does not seem to be cabable of handling the indepenet clauses within a parallel region.

Per the OpenACC spec section 2.8.6, the independent clause only applies to loop directives within kernel regions.

I encounter a similar problem if I try to compile my code with a vector_length clause within a kernels region (the vector_length clause is part of a acc loop directive within the kernels region.). What am I doing wrong here?

Section 2.4.7, the vector_length is only allowed on the parallel construct.

Furthermore, I realized that the 12.6 compiler decides to shedule the workload among gangs and vectors eventhough I explicitly tell the compiler to schedule the work among gangs only. Is this behaviour expected or is it a bug?

Can you post an basic example of the code (not necessarily specific code but more the structure of the code)? My guess is that you have non-tightly nested loops or parallel with tightly nested loop but no collapse clause.

For example, in the non-tightly nested loop, the inner loop gang schedule is illegal due to the code between the two loops. The inner loop would need to be executed by the vectors in order to create a valid kernel.

!$acc parallel
!$acc loop gang
do i=1,N
… do something
!$acc loog gang ! << this will be ignored
do j=1,M
– do more

In the tightly nest loop case, the default is to schedule the outer loop as the gang. In order to schedule multiple loops in a gang, the collapse clause is needed:

!$acc parallel
!$acc loop gang collapse(2)
do i=1,N
do j=1,M
– do more

Hope this helps,
Mat

PaulPa · August 15, 2012, 9:03am

Hi Mat,

thank you for your help and once again sorry for the late response.

Here is a code snippet:

#pragma acc kernels present(Ahat[0:n*k],x[0:k],tmpArray[0:n*numBlocksK])   
{	
#pragma acc loop independent gang collapse(2) 
	for (int i=0; i<numBlocksN; i++) {
	    for(int j=0; j<numBlocksK; j++) {
#pragma acc loop independent vector 
			for(int l = 0 ; l < BLOCK_SIZE ; ++l){
				precision tmp;
				tmp = 0.0;
#pragma unroll(UNROLL_SIZE)
				for(int m = 0 ; m < BLOCK_SIZE ; ++m){
					tmp += Ahat[(i*BLOCK_SIZE +l)* k + j*BLOCK_SIZE + m] * x[j*BLOCK_SIZE + m];
				}
				tmpArray[(i*BLOCK_SIZE + l ) * numBlocksK + j] += tmp;
			}
	    } // for j
	} // for i
}

The compiler output looks like this:

 95, Loop is parallelizable
         Accelerator kernel generated
         92, #pragma acc loop gang /* blockIdx.y */
         93, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
         95, #pragma acc loop vector(4) /* threadIdx.y */
             CC 2.0 : 28 registers; 0 shared, 84 constant, 0 local memory bytes

Line 92 corresponds to the i-loop.

Section 2.4.7, the vector_length is only allowed on the parallel construct.

Is there any other way such that I can influence the vector_length within the kernels region?
The vector_length is a performance-critical parameter, hence I don’t understand why the programmer would not be able to change this value.

Thank you.

Best,
Paul

MatColgrove · August 15, 2012, 6:31pm

Hi Paul,

For this code, it might be better to use the “parallel” construct rather than “kernels”. It will give you the finer grain control you’re looking for. Something along the lines of:

#pragma acc parallel present(Ahat[0:n*k],x[0:k],tmpArray[0:n*numBlocksK]) vector_length(256)   
{    
#pragma acc loop independent gang collapse(2) 
   for (int i=0; i<numBlocksN; i++) { 
       for(int j=0; j<numBlocksK; j++) { 
#pragma acc loop independent vector 
         for(int l = 0 ; l < BLOCK_SIZE ; ++l){ 
            precision tmp; 
            tmp = 0.0; 
#pragma unroll(UNROLL_SIZE) 
            for(int m = 0 ; m < BLOCK_SIZE ; ++m){ 
               tmp += Ahat[(i*BLOCK_SIZE +l)* k + j*BLOCK_SIZE + m] * x[j*BLOCK_SIZE + m]; 
            } 
            tmpArray[(i*BLOCK_SIZE + l ) * numBlocksK + j] += tmp; 
         } 
       } // for j 
   } // for i 
}

Hope this helps,
Mat

PaulPa · August 19, 2012, 9:43am

Hi Mat,

thanks for your reply.

I actually had such an implementation before - it worked fine - but I would like to achieve such an implementation with the kernels region as well.

Hence, will there be the possibility to change the vector_length within a kernels region?

Furthermore, do you know why the compiler schedules the workload among gangs and vectors?

Best,
Paul

MatColgrove · August 20, 2012, 4:31pm

Hence, will there be the possibility to change the vector_length within a kernels region?

In a kernel region, the “loop vector” clause can accept a width, for example “vector(128)”.

Furthermore, do you know why the compiler schedules the workload among gangs and vectors?

I guess I’m not understanding the question. A “gang” corresponds to a CUDA Block while a “vector” corresponds to the threads within a block. The compiler would need to schedule both since this is how the thread execution model is organized on an NVIDIA device (See: Account Login | PGI)

For Kernels, let’s remove the “collapse” and explicitly schedule the second loop. You could also explicitly set the gang width as well.

#pragma acc kernels present(Ahat[0:n*k],x[0:k],tmpArray[0:n*numBlocksK])   
{   
#pragma acc loop independent gang
   for (int i=0; i<numBlocksN; i++) {
#pragma acc loop independent gang  // You can set the width here as well
       for(int j=0; j<numBlocksK; j++) {
#pragma acc loop independent vector(256)  // Vector length should be the same as BLOCK_SIZE
         for(int l = 0 ; l < BLOCK_SIZE ; ++l){
            precision tmp;
            tmp = 0.0;
#pragma unroll(UNROLL_SIZE)
            for(int m = 0 ; m < BLOCK_SIZE ; ++m){
               tmp += Ahat[(i*BLOCK_SIZE +l)* k + j*BLOCK_SIZE + m] * x[j*BLOCK_SIZE + m];
            }
            tmpArray[(i*BLOCK_SIZE + l ) * numBlocksK + j] += tmp;
         }
       } // for j
   } // for i
}

Mat

Topic		Replies	Views
Loop "too deeply nested" and "data dependency Legacy PGI Compilers	9	10588	November 27, 2017
Accelerator restriction: invalid loop Legacy PGI Compilers	5	6411	September 26, 2017
MatMul with openACC Legacy PGI Compilers	7	13039	December 17, 2012
License issue when using pgi/20.4 compiler Legacy PGI Compilers	6	298	April 16, 2024
Atomic usage Legacy PGI Compilers	8	3618	July 2, 2019
OpenACC parallel loop gang, vector Legacy PGI Compilers	4	6635	December 7, 2023
acc kernels / acc parallel question Legacy PGI Compilers	2	3859	September 1, 2017
Clause 'Worker(value)' not allowed in 'Parallel Loop' direct Legacy PGI Compilers	2	1829	April 17, 2018
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25076	December 11, 2014
Acceleration of nested loops Legacy PGI Compilers	5	4047	November 8, 2016

paralle + independent and kernels + vector_length()

Related topics