I seek to understand the current scalability of OpenACC with the PGI tools. To accomplish this I have the following completely parallelizable loop that I am testing compared to a vectorized version of the same on the CPU:

```
#pragma acc parallel loop copyin(d0[0:mz0],coeff[0:5],l1[0:2],l2[0:2]) copyout(r0[0:cnt])
for(i=0;i<cnt;i++){
double f0=d0[i*m0];
double s0=fabs(f0);
aplint32 s1=f0>=0;
double s2=0.2316419*s0;
double s3=1+s2;
double s4=1.0/s3;
double s5=coeff[0]*pow(s4,1);
s5+=coeff[1]*pow(s4,2);
s5+=coeff[2]*pow(s4,3);
s5+=coeff[3]*pow(s4,4);
s5+=coeff[4]*pow(s4,5);
double s6=s0*s0;
double s7=((double)s6)/((double)-2);
double s8=exp((double)s7);
double s9=s8*s5;
double s10=0.3989422804*s9;
aplint32 s11=l1[s1];
double s12=s11+s10;
aplint32 s13=l2[s1];
double s14=s13*s12;
r0[i]=s14;
}
```

I am trying to understand whether the main thing to do to make this code go fast is just the pragma above, or if there are additional things that need to be done to tune this code to make it go as fast as possible. At the moment I am able to see some acceleration, but it does not seem to be as much as one might expect from highly parallel purely scalar code.

I cannot seem to find a good “tuning guide” explaining how one might ensure that the code such as above is correct or optimal, or what compiler options PGI does best with.