Vectorizing a loop with a reduction on a conditional

spaine · July 7, 2017, 3:13pm

Hi,

I have a loop which is successfully vectorized by gcc5, but not by pgcc 17.5. Consequently, the gcc version runs about twice as fast. This is on a Mac Pro 2103 running OS X 10.10.5.

The context in which this occurs is a line-by-line radiative transfer code. There is an outer loop over a catalog of spectral lines having different strengths, and an inner loop over frequency channels that builds up a spectral absorption coefficient array one line profile at a time. There are two versions of this inner loop. The first version simply adds the contribution from a spectral line as simply as possible. The second version monitors the fractional contribution of a given line to every frequency channel of the accumulating spectral absorption coefficient, and increments an integer flag variable if the contribution is above a user-defined threshold. Outside the loop this flag is tested, and if it is zero, the current line strength becomes a threshold below which subsequent catalog lines are skipped.

Here is the code for both versions of the inner loop, selected by an if-else. The type gridsize_t of the loop variable is just a signed int:

case LINESHAPE_GROSS:
                {
                    double gamma = line_data[i].gamma;
                    double r0, r1, r2;
                    r0 = S * FOUR_ON_PI * gamma;
                    r1 = 4. * gamma * gamma;
                    r2 = f0 * f0;
                    if (zero_tol || pass == 0) {
                        gridsize_t j;
                        for (j = 0; j < ngrid; ++j) {
                            double r3, r4;
                            r3 = r1 * f2[j];
                            r4 = f2[j] - r2;
                            r4 *= r4;
                            k[j] += r0 / (r4 + r3);
                        }
                    } else {
                        gridsize_t j;
                        unsigned int dkflag = 0;
                        for (j = 0; j < ngrid; ++j) {
                            double r3, r4;
                            double dk;
                            r3 = r1 * f2[j];
                            r4 = f2[j] - r2;
                            r4 *= r4;
                            dk = r0 / (r4 + r3);
                            k[j] += dk;
                            dkflag += dk > dktol * k[j];
                        }
                        if (!dkflag)
                            Smin = S * (1. + DBL_EPSILON);
                    }
                }
                break;

Both pgcc and gcc vectorize the “zero_tol” version of the loop (original source line 818), and the code generated by both compilers runs at the same speed. However, pgcc doesn’t vectorize the second version (original source line 828), and unrolls it instead:

    818, Loop not fused: no successor loop
         Generated 3 alternate versions of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
    828, Loop not fused: no successor loop
         Unrolled inner loop 4 times
         Generated 2 prefetch instructions for the loop

The resulting code runs nearly 2x slower than gcc’s vectorized version.

Somehow the conditional or the reduction on the scalar dktol seems to be convincing pgcc that the other code in the loop isn’t worth vectorizing. Right now, I’m using

-fast -Msafeptr -Minfo -Mneginfo

I’ve played with various -Mvect switches to no avail, but I’m new to PGI and really shooting in the dark. Any advice would be greatly appreciated.

Thanks,

Scott Paine

tull · July 7, 2017, 4:27pm

Scott,

Thanks for your interest. We can’t determine performance
issues without a complete program. So much of the optimization
is dependent on how data is declared, size of arrays, and what is known
at compile time vs. runtime. Issue could come down to a data type
we had not dealt with yet.

If you need to send programs without public view, you can send them
to trs@pgroup.com.

dave

spaine · July 7, 2017, 5:36pm

Hi Dave,

Many thanks for the quick reply. If you’re able to take a look, the most recent public release (v. 9.2) of the code in question is available at

https://www.cfa.harvard.edu/~spaine/am/

The code snippet I posted was from the source file linesum.c.

I’ve also put a .tgz archive of a snapshot of my current development version (9.3) on our anonymous ftp server at

ftp://cfa-ftp.harvard.edu/pub/spaine/am-9.3.tgz

the Makefile for the dev version includes the targets pgi and pgi-omp which I’ve been using for testing pgcc.

Thanks,

Scott

spaine · July 12, 2017, 8:59pm

Hi all,

Taking a hint from Dave’s comment about possible data type dependence, I tried changing the type of dkflag from unsigned int to double. With this change pgcc then vectorized the loop but gcc no longer did so. The workaround was a compiler-dependent typedef to capture several similar loops in the same source file:

/*
 * In the tolerance-monitored versions of the loops corresponding to
 * each of the lineshapes in linesum_block(), a local flag variable
 * called dkflag is incremented whenever the fractional contribution
 * from the current spectral line to an element k[i] of the accumulating
 * spectral absorption coefficient k[] exceeds a tolerance threshold.
 * This is a scalar sum reduction on a conditional expression within
 * these loops, the presence of which can affect whether the compiler
 * vectorizes the loop.
 *
 * Surprisingly, the type of dkflag has opposite effects on whether
 * these loops are vectorized by GNU gcc and by PGI pgcc, which
 * affects performance by a factor of two on recent (as of 2017)
 * x86 processors.  Here, an appropriate type is defined depending
 * on the compiler.
 */
#if   defined __PGI     /* tested with pgcc version 17.5 */
    typedef double dkflag_t;
#elif defined __GNUC__  /* tested with gcc version 5.4 */
    typedef unsigned int dkflag_t;
#else                   /* default */
    typedef unsigned int dkflag_t;
#endif

The definition of dkflag in the original example and similar loops was then changed accordingly:

        case LINESHAPE_GROSS:
                {
                    double gamma = line_data[i].gamma;
                    double r0, r1, r2;
                    r0 = S * FOUR_ON_PI * gamma;
                    r1 = 4. * gamma * gamma;
                    r2 = f0 * f0;
                    if (zero_tol || pass == 0) {
                        gridsize_t j;
                        for (j = 0; j < ngrid; ++j) {
                            double r3, r4;
                            r3 = r1 * f2[j];
                            r4 = f2[j] - r2;
                            r4 *= r4;
                            k[j] += r0 / (r4 + r3);
                        }
                    } else {
                        gridsize_t j;
                        dkflag_t dkflag = 0;
                        for (j = 0; j < ngrid; ++j) {
                            double r3, r4;
                            double dk;
                            r3 = r1 * f2[j];
                            r4 = f2[j] - r2;
                            r4 *= r4;
                            dk = r0 / (r4 + r3);
                            k[j] += dk;
                            dkflag += dk > dktol * k[j];
                        }
                        if (!dkflag)
                            Smin = S * (1. + DBL_EPSILON);
                    }
                }
                break;

Compiling with either pgcc or gcc now results in nearly identical performance.

Scott

tull · July 13, 2017, 5:22pm

We appreciate the update Scott. We have not addressed your performance issue (summer, holidays, etc), but we are impressed that
you perturbated the program slightly to see how it effected the
compiler results. This is how it’s done.

We have created TPR 24523 to highlight what you have learned, and
to request any other help to make the code run faster.

dave

spaine · March 15, 2018, 7:11pm

Updating this thread, it turns out that later versions of gcc6 and gcc7 could not be coaxed into vectorizing the reduction loop regardless of the type of the reduction variable, so the solution posted earlier no longer worked for gcc, though pgcc was still OK.

Instead, I ended up splitting the loop and using a temporary array. Here are the parts of the new code corresponding to the original example:

            case LINESHAPE_GROSS:
                {
                    double gamma = line_data[i].gamma;
                    double r0, r1, r2;
                    r0 = S * FOUR_ON_PI * gamma;
                    r1 = 4. * gamma * gamma;
                    r2 = f0 * f0;
                    if (dk == NULL || pass == 0) {
                        for (j = 0; j < ngrid; ++j) {
                            double r3, r4;
                            r3 = r1 * f2[j];
                            r4 = f2[j] - r2;
                            r4 *= r4;
                            k[j] += r0 / (r4 + r3);
                        }
                    } else {
                        for (j = 0; j < ngrid; ++j) {
                            double r3, r4;
                            r3 = r1 * f2[j];
                            r4 = f2[j] - r2;
                            r4 *= r4;
                            dk[j] = r0 / (r4 + r3);
                        }
                    }
                }
                break;

Followed by:

            if (dk != NULL && pass != 0) {
                for (j = jmin; j < jmax; ++j)
                    k[j] += dk[j];
                for (j = jmin; j < jmax; ++j)
                    if (dk[j] > dktol * k[j])
                        break;
                if (j >= jmax)
                    Smin = S * (1. + DBL_EPSILON);
            }

In the original tolerance-monitored loop, for each k an increment dk was computed; then dk was accumulated into k; then dk / k was checked against a tolerance. In the new version, these three steps are split into three loops, using a temporary array dk[] to save all the increments computed in the first loop.

The first two loops are now vectorized by both pgcc (2017 and 2018) and gcc (5,6,7). The early exit in the third loop inhibits vectorization, but it’s a relatively small part of the computation, and this scheme was found to be fastest compared with attempts to hand-unroll the loop and/or let it run to completion.

For the least compute-intensive lineshape (Gross) shown here, the extra overhead in the new scheme does cost about 20% more time compared with the earlier single loop plus reduction version, when that version was successfully vectorized. For other more compute-intensive lineshape cases in this code, the overhead is relatively less. And, of course, the 20% price is worth paying to recover the 2x factor gained by vectorization for the compilers that wouldn’t vectorize otherwise. (Not to mention the 4x factor likely to come soon with the next generation of processors.)

sscalpone · January 24, 2023, 6:00am

Hi Scott,

It’s been a while. I believe I correlated the code snippet that you reported to your current code base. The HPC SDK compiler seems to be vectorizing that code now. The C compiler driver is called nvc, but we still have a pgcc alias in our bin so your makefile worked without change!

spaine · January 24, 2023, 4:23pm

Amazing follow-up, thanks! When I last checked, it was gcc that was no longer vectorizing, and driving me to split the loop into three phases. At some point it would be interesting to go back and check, but for now the split version is robust with a fairly small performance hit.

system · February 7, 2023, 4:24pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.