I am currently working with the new pgi compiler 12.6 (linux) and I am still running into problems regarding some OpenACC directives.
The compiler does not seem to be cabable of handling the indepenet clauses within a parallel region. Is that suppossed to be that way or is this feature still missing?
I encounter a similar problem if I try to compile my code with a vector_length clause within a kernels region (the vector_length clause is part of a acc loop directive within the kernels region.). What am I doing wrong here?
Furthermore, I realized that the 12.6 compiler decides to shedule the workload among gangs and vectors eventhough I explicitly tell the compiler to schedule the work among gangs only. Is this behaviour expected or is it a bug?
If you guys want, I can provide you with the source code and/or file a bug-report.
The compiler does not seem to be cabable of handling the indepenet clauses within a parallel region.
Per the OpenACC spec section 2.8.6, the independent clause only applies to loop directives within kernel regions.
I encounter a similar problem if I try to compile my code with a vector_length clause within a kernels region (the vector_length clause is part of a acc loop directive within the kernels region.). What am I doing wrong here?
Section 2.4.7, the vector_length is only allowed on the parallel construct.
Furthermore, I realized that the 12.6 compiler decides to shedule the workload among gangs and vectors eventhough I explicitly tell the compiler to schedule the work among gangs only. Is this behaviour expected or is it a bug?
Can you post an basic example of the code (not necessarily specific code but more the structure of the code)? My guess is that you have non-tightly nested loops or parallel with tightly nested loop but no collapse clause.
For example, in the non-tightly nested loop, the inner loop gang schedule is illegal due to the code between the two loops. The inner loop would need to be executed by the vectors in order to create a valid kernel.
!$acc parallel
!$acc loop gang
do i=1,N
… do something
!$acc loog gang ! << this will be ignored
do j=1,M
– do more
In the tightly nest loop case, the default is to schedule the outer loop as the gang. In order to schedule multiple loops in a gang, the collapse clause is needed:
!$acc parallel
!$acc loop gang collapse(2)
do i=1,N
do j=1,M
– do more
Section 2.4.7, the vector_length is only allowed on the parallel construct.
Is there any other way such that I can influence the vector_length within the kernels region?
The vector_length is a performance-critical parameter, hence I don’t understand why the programmer would not be able to change this value.
For this code, it might be better to use the “parallel” construct rather than “kernels”. It will give you the finer grain control you’re looking for. Something along the lines of:
#pragma acc parallel present(Ahat[0:n*k],x[0:k],tmpArray[0:n*numBlocksK]) vector_length(256)
{
#pragma acc loop independent gang collapse(2)
for (int i=0; i<numBlocksN; i++) {
for(int j=0; j<numBlocksK; j++) {
#pragma acc loop independent vector
for(int l = 0 ; l < BLOCK_SIZE ; ++l){
precision tmp;
tmp = 0.0;
#pragma unroll(UNROLL_SIZE)
for(int m = 0 ; m < BLOCK_SIZE ; ++m){
tmp += Ahat[(i*BLOCK_SIZE +l)* k + j*BLOCK_SIZE + m] * x[j*BLOCK_SIZE + m];
}
tmpArray[(i*BLOCK_SIZE + l ) * numBlocksK + j] += tmp;
}
} // for j
} // for i
}
Hence, will there be the possibility to change the vector_length within a kernels region?
In a kernel region, the “loop vector” clause can accept a width, for example “vector(128)”.
Furthermore, do you know why the compiler schedules the workload among gangs and vectors?
I guess I’m not understanding the question. A “gang” corresponds to a CUDA Block while a “vector” corresponds to the threads within a block. The compiler would need to schedule both since this is how the thread execution model is organized on an NVIDIA device (See: Account Login | PGI)
For Kernels, let’s remove the “collapse” and explicitly schedule the second loop. You could also explicitly set the gang width as well.
#pragma acc kernels present(Ahat[0:n*k],x[0:k],tmpArray[0:n*numBlocksK])
{
#pragma acc loop independent gang
for (int i=0; i<numBlocksN; i++) {
#pragma acc loop independent gang // You can set the width here as well
for(int j=0; j<numBlocksK; j++) {
#pragma acc loop independent vector(256) // Vector length should be the same as BLOCK_SIZE
for(int l = 0 ; l < BLOCK_SIZE ; ++l){
precision tmp;
tmp = 0.0;
#pragma unroll(UNROLL_SIZE)
for(int m = 0 ; m < BLOCK_SIZE ; ++m){
tmp += Ahat[(i*BLOCK_SIZE +l)* k + j*BLOCK_SIZE + m] * x[j*BLOCK_SIZE + m];
}
tmpArray[(i*BLOCK_SIZE + l ) * numBlocksK + j] += tmp;
}
} // for j
} // for i
}