 # Evaluating and understanding OpenACC and PGI

I seek to understand the current scalability of OpenACC with the PGI tools. To accomplish this I have the following completely parallelizable loop that I am testing compared to a vectorized version of the same on the CPU:

``````#pragma acc parallel loop copyin(d0[0:mz0],coeff[0:5],l1[0:2],l2[0:2]) copyout(r0[0:cnt])
for(i=0;i<cnt;i++){
double f0=d0[i*m0];
double s0=fabs(f0);
aplint32 s1=f0>=0;
double s2=0.2316419*s0;
double s3=1+s2;
double s4=1.0/s3;
double s5=coeff*pow(s4,1);
s5+=coeff*pow(s4,2);
s5+=coeff*pow(s4,3);
s5+=coeff*pow(s4,4);
s5+=coeff*pow(s4,5);
double s6=s0*s0;
double s7=((double)s6)/((double)-2);
double s8=exp((double)s7);
double s9=s8*s5;
double s10=0.3989422804*s9;
aplint32 s11=l1[s1];
double s12=s11+s10;
aplint32 s13=l2[s1];
double s14=s13*s12;
r0[i]=s14;
}
``````

I am trying to understand whether the main thing to do to make this code go fast is just the pragma above, or if there are additional things that need to be done to tune this code to make it go as fast as possible. At the moment I am able to see some acceleration, but it does not seem to be as much as one might expect from highly parallel purely scalar code.

I cannot seem to find a good “tuning guide” explaining how one might ensure that the code such as above is correct or optimal, or what compiler options PGI does best with.

I cannot seem to find a good “tuning guide”

There’s a best practice guide in the works. The information is all out there but just a bit scattered. Though, you can find some good training videos and papers at: http://www.openacc.org/content/education

For this loop I’d look at the trip count and how often data is moved. If the loop gets executed multiple times, then you should investigate using a data region higher-up in the code to minimize data movement.

The next question to investigate is if loop’s trip count is large enough. There’s overhead in launching device routines as well as the data movement so you need to have enough computation on the device to offset this overhead. Hence if “cnt” is relatively small, you wont see much speed-up.

Since you’re use divides, “pow” and “exp”, you can try adding the compiler option “-ta=tesla:cc35,fastmath” to use less precise but faster versions.

Next, you can look at the register usage by adding the flag “-ta=tesla:ptxinfo”. Using too many registers can lower your occupancy and reduce performance. You can lower the number of registers used via the flag “-ta=tesla,maxregcount:”. The caveat being that a lower number of registers can lead to spilling. Spilling isn’t bad so long as the spills go just to the cache. It’s when the code spills to global memory that performance is aversely effected. To determine where this tipping point is you’ll need to profiling tools such as NVVP.

Register usage is determine by the local variables. While lifetime analysis can find cases where registers can be reused, you may want to reduce and reuse some of the local variables.

You may want to spend some time using the OpenACC loop schedule clauses to see if you can effect performance. However given this is just a single loop level, it’s unlikely it will make much of a difference.

If you don’t really need double precision, single will be faster. Though if you use float instead, also add the flag “-Mfcon” so your floating point constant values use float as well (by default they are double), and use “powf” and “expf”.

Hope this helps,
Mat

Thank you very much for the feedback. This is indeed very useful. Do I understand, correctly, that the majority of the optimizations that might be taken on this code have to do with adjusting options through compiler flags, rather than editing the source code?

It seems that there is the obvious consideration of making sure that I can avoid data transfer (not an issue in this case, but well taken in general), and then most of the other options deal with compiler flags, except perhaps for reducing the number of local variables and using loop scheduling, though that would seem to be more for something more complicated than this single loop, yes?

Do I understand, correctly, that the majority of the optimizations that might be taken on this code have to do with adjusting options through compiler flags, rather than editing the source code?

I believe so.

using loop scheduling, though that would seem to be more for something more complicated than this single loop, yes?

I tend to agree with this, but don’t want to discourage you from trying a few different vector lengths if it’s a quick experiment.

• Mat