paralle + independent and kernels + vector_length()

Hello,

I am currently working with the new pgi compiler 12.6 (linux) and I am still running into problems regarding some OpenACC directives.

The compiler does not seem to be cabable of handling the indepenet clauses within a parallel region. Is that suppossed to be that way or is this feature still missing?

I encounter a similar problem if I try to compile my code with a vector_length clause within a kernels region (the vector_length clause is part of a acc loop directive within the kernels region.). What am I doing wrong here?

Furthermore, I realized that the 12.6 compiler decides to shedule the workload among gangs and vectors eventhough I explicitly tell the compiler to schedule the work among gangs only. Is this behaviour expected or is it a bug?

If you guys want, I can provide you with the source code and/or file a bug-report.

Best,
Paul

Hi Paul,

The compiler does not seem to be cabable of handling the indepenet clauses within a parallel region.

Per the OpenACC spec section 2.8.6, the independent clause only applies to loop directives within kernel regions.

I encounter a similar problem if I try to compile my code with a vector_length clause within a kernels region (the vector_length clause is part of a acc loop directive within the kernels region.). What am I doing wrong here?

Section 2.4.7, the vector_length is only allowed on the parallel construct.

Furthermore, I realized that the 12.6 compiler decides to shedule the workload among gangs and vectors eventhough I explicitly tell the compiler to schedule the work among gangs only. Is this behaviour expected or is it a bug?

Can you post an basic example of the code (not necessarily specific code but more the structure of the code)? My guess is that you have non-tightly nested loops or parallel with tightly nested loop but no collapse clause.

For example, in the non-tightly nested loop, the inner loop gang schedule is illegal due to the code between the two loops. The inner loop would need to be executed by the vectors in order to create a valid kernel.

!$acc parallel
!$acc loop gang
do i=1,N
… do something
!$acc loog gang ! << this will be ignored
do j=1,M
– do more

In the tightly nest loop case, the default is to schedule the outer loop as the gang. In order to schedule multiple loops in a gang, the collapse clause is needed:

!$acc parallel
!$acc loop gang collapse(2)
do i=1,N
do j=1,M
– do more

Hope this helps,
Mat

Hi Mat,

thank you for your help and once again sorry for the late response.

Here is a code snippet:

#pragma acc kernels present(Ahat[0:n*k],x[0:k],tmpArray[0:n*numBlocksK])   
{	
#pragma acc loop independent gang collapse(2) 
	for (int i=0; i<numBlocksN; i++) {
	    for(int j=0; j<numBlocksK; j++) {
#pragma acc loop independent vector 
			for(int l = 0 ; l < BLOCK_SIZE ; ++l){
				precision tmp;
				tmp = 0.0;
#pragma unroll(UNROLL_SIZE)
				for(int m = 0 ; m < BLOCK_SIZE ; ++m){
					tmp += Ahat[(i*BLOCK_SIZE +l)* k + j*BLOCK_SIZE + m] * x[j*BLOCK_SIZE + m];
				}
				tmpArray[(i*BLOCK_SIZE + l ) * numBlocksK + j] += tmp;
			}
	    } // for j
	} // for i
}

The compiler output looks like this:

 95, Loop is parallelizable
         Accelerator kernel generated
         92, #pragma acc loop gang /* blockIdx.y */
         93, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
         95, #pragma acc loop vector(4) /* threadIdx.y */
             CC 2.0 : 28 registers; 0 shared, 84 constant, 0 local memory bytes

Line 92 corresponds to the i-loop.

Section 2.4.7, the vector_length is only allowed on the parallel construct.

Is there any other way such that I can influence the vector_length within the kernels region?
The vector_length is a performance-critical parameter, hence I don’t understand why the programmer would not be able to change this value.

Thank you.

Best,
Paul

Hi Paul,

For this code, it might be better to use the “parallel” construct rather than “kernels”. It will give you the finer grain control you’re looking for. Something along the lines of:

#pragma acc parallel present(Ahat[0:n*k],x[0:k],tmpArray[0:n*numBlocksK]) vector_length(256)   
{    
#pragma acc loop independent gang collapse(2) 
   for (int i=0; i<numBlocksN; i++) { 
       for(int j=0; j<numBlocksK; j++) { 
#pragma acc loop independent vector 
         for(int l = 0 ; l < BLOCK_SIZE ; ++l){ 
            precision tmp; 
            tmp = 0.0; 
#pragma unroll(UNROLL_SIZE) 
            for(int m = 0 ; m < BLOCK_SIZE ; ++m){ 
               tmp += Ahat[(i*BLOCK_SIZE +l)* k + j*BLOCK_SIZE + m] * x[j*BLOCK_SIZE + m]; 
            } 
            tmpArray[(i*BLOCK_SIZE + l ) * numBlocksK + j] += tmp; 
         } 
       } // for j 
   } // for i 
}

Hope this helps,
Mat

Hi Mat,

thanks for your reply.

I actually had such an implementation before - it worked fine - but I would like to achieve such an implementation with the kernels region as well.

Hence, will there be the possibility to change the vector_length within a kernels region?

Furthermore, do you know why the compiler schedules the workload among gangs and vectors?

Best,
Paul

Hence, will there be the possibility to change the vector_length within a kernels region?

In a kernel region, the “loop vector” clause can accept a width, for example “vector(128)”.

Furthermore, do you know why the compiler schedules the workload among gangs and vectors?

I guess I’m not understanding the question. A “gang” corresponds to a CUDA Block while a “vector” corresponds to the threads within a block. The compiler would need to schedule both since this is how the thread execution model is organized on an NVIDIA device (See: http://www.pgroup.com/lit/articles/insider/v2n1a5.htm)

For Kernels, let’s remove the “collapse” and explicitly schedule the second loop. You could also explicitly set the gang width as well.

#pragma acc kernels present(Ahat[0:n*k],x[0:k],tmpArray[0:n*numBlocksK])   
{   
#pragma acc loop independent gang
   for (int i=0; i<numBlocksN; i++) {
#pragma acc loop independent gang  // You can set the width here as well
       for(int j=0; j<numBlocksK; j++) {
#pragma acc loop independent vector(256)  // Vector length should be the same as BLOCK_SIZE
         for(int l = 0 ; l < BLOCK_SIZE ; ++l){
            precision tmp;
            tmp = 0.0;
#pragma unroll(UNROLL_SIZE)
            for(int m = 0 ; m < BLOCK_SIZE ; ++m){
               tmp += Ahat[(i*BLOCK_SIZE +l)* k + j*BLOCK_SIZE + m] * x[j*BLOCK_SIZE + m];
            }
            tmpArray[(i*BLOCK_SIZE + l ) * numBlocksK + j] += tmp;
         }
       } // for j
   } // for i
}
  • Mat