Hi,
I have a loop which is successfully vectorized by gcc5, but not by pgcc 17.5. Consequently, the gcc version runs about twice as fast. This is on a Mac Pro 2103 running OS X 10.10.5.
The context in which this occurs is a line-by-line radiative transfer code. There is an outer loop over a catalog of spectral lines having different strengths, and an inner loop over frequency channels that builds up a spectral absorption coefficient array one line profile at a time. There are two versions of this inner loop. The first version simply adds the contribution from a spectral line as simply as possible. The second version monitors the fractional contribution of a given line to every frequency channel of the accumulating spectral absorption coefficient, and increments an integer flag variable if the contribution is above a user-defined threshold. Outside the loop this flag is tested, and if it is zero, the current line strength becomes a threshold below which subsequent catalog lines are skipped.
Here is the code for both versions of the inner loop, selected by an if-else. The type gridsize_t of the loop variable is just a signed int:
case LINESHAPE_GROSS:
{
double gamma = line_data[i].gamma;
double r0, r1, r2;
r0 = S * FOUR_ON_PI * gamma;
r1 = 4. * gamma * gamma;
r2 = f0 * f0;
if (zero_tol || pass == 0) {
gridsize_t j;
for (j = 0; j < ngrid; ++j) {
double r3, r4;
r3 = r1 * f2[j];
r4 = f2[j] - r2;
r4 *= r4;
k[j] += r0 / (r4 + r3);
}
} else {
gridsize_t j;
unsigned int dkflag = 0;
for (j = 0; j < ngrid; ++j) {
double r3, r4;
double dk;
r3 = r1 * f2[j];
r4 = f2[j] - r2;
r4 *= r4;
dk = r0 / (r4 + r3);
k[j] += dk;
dkflag += dk > dktol * k[j];
}
if (!dkflag)
Smin = S * (1. + DBL_EPSILON);
}
}
break;
Both pgcc and gcc vectorize the “zero_tol” version of the loop (original source line 818), and the code generated by both compilers runs at the same speed. However, pgcc doesn’t vectorize the second version (original source line 828), and unrolls it instead:
818, Loop not fused: no successor loop
Generated 3 alternate versions of the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
828, Loop not fused: no successor loop
Unrolled inner loop 4 times
Generated 2 prefetch instructions for the loop
The resulting code runs nearly 2x slower than gcc’s vectorized version.
Somehow the conditional or the reduction on the scalar dktol seems to be convincing pgcc that the other code in the loop isn’t worth vectorizing. Right now, I’m using
-fast -Msafeptr -Minfo -Mneginfo
I’ve played with various -Mvect switches to no avail, but I’m new to PGI and really shooting in the dark. Any advice would be greatly appreciated.
Thanks,
Scott Paine