odd results using O3 optimization and openMP

Hi All,

I have installed the latest PGI version (15.5), and I noticed a problem when using an Intel box:

model name : Intel® Xeon® CPU E5-2699 v3 @ 2.30GHz

This compile command

pgf90 -mp -O3 -mcmodel=medium -Mextend -o fred_par fred.for

produces an executable that gives the wrong results when run in parallel. If I adjust the optimization level to O1 the program works in parallel.

However, this compile command

pgf90 -mp -O3 -tp=p7 -mcmodel=medium -Mextend -o fred_par fred.for

makes an executable that gives the correct results in parallel mode.

Compiling without the -mp flag and O3 optimization, as in

pgf90 -O3 -mcmodel=medium -Mextend -o fred fred.for

seems to give the correct results when run on a single CPU.

There is a parallel loop that performs an initial set-up, and this seems to run the same way regardless of the optimization level or the target architecture. However on the second parallel loop (where basically everything else happens until the code terminates) things go off the rails with -mp -O3 and no -tp specified.

We also have PGI 15.5 installed on an Opteron system:

model name : AMD Opteron™ Processor 6320

and as far as I can tell, I get good results in parallel using the -O3 flag and no -tp option.

Jerry

Hi Jerry,

I’m guessing that it’s FMA (Fuse-Multipy-Add) instructions that’s causing the difference. FMA provides more precise computation since it does not need to store intermediary results when computing A*B+C. However, it can lead to differences to non-FMA results.

Does adding “-Mnofma” help?

  • Mat

Hi Mat,

I tried this compile command:

pgf90 -mp -O3 -Mnofma -mcmodel=medium -Mextend -o

on the Xeon system and I seem to be getting good results.

I getting back to the issue when the FMA is used, I get an error only when I also have the -mp flag. I tried some experiments using the -mp flag but only one thread. In the initial loop, several different models are computed where the parameters are read from files. The loop simply loops over the models. Using one thread, I noticed that the first model seems to make it through the many subroutines that are needed to get the final answer. However, the wrong results are returned on loop index 2 and beyond. I then switched around the input files so that the first and second models switched places. Using this new ordering, the model at loop index 1 went through OK, but the second one (and those beyond) did not.

So it seems that with the FMA instructions, all of the subroutines seem to function, or are at least capable of returning the correct answer in certain circumstances. Things apparently go wrong when they are moved to the stack.

Finally, on a somewhat related topic, are the FMA instructions faster? What compile flags would you recommend to milk out all of the speed I can, assuming this issue of the FMA with the -mp flags can be figured out?

Jerry

Finally, on a somewhat related topic, are the FMA instructions faster?

In general, yes FMA is faster. It reduces several instructions and stores into a single operation. It’s also considered more accurate.

What compile flags would you recommend to milk out all of the speed I can, assuming this issue of the FMA with the -mp flags can be figured out?

We recommend using “-fast -Mfprelaxed -Mipa=fast,inline” as the baseline performance flagset. If your code is numerically sensitive, you may need to remove “-Mfprelaxed”.

Other flags to try depending on your hardware:
-Mvect=simd:256
-Mvect=noaltcode
-Mvect=partial

Unrolling is enabled in “-fast”, but you may want to try adjusting the sizes:

% pgf90 -help -Munroll
-M[no]unroll[=c:<n>|n:<n>|m:<n>]
                    Enable loop unrolling
    c:<n>           Completely unroll loops with loop count n or less
    n:<n>           Unroll single-block loops n times
    m:<n>           Unroll multi-block loops n times
    -Munroll        Unroll loops with a loop count of 1

You can also adjust the number of levels to inline. “-Mipa=fast,inline:”

Instead of OpenMP, you can try using OpenACC to accelerate your code on an accelerator such as a NVIDIA GPU. Acceleration can significantly speed-up highly parallel code with a large amount of computation.

  • Mat

Hi Mat,

Thanks for the suggestions. My understanding is that one cannot put subroutine calls into the parallel loops when using openACC. If so, this would rule out its use for me at the moment.

I have tried a few of your suggested flags. Here is the /proc/cpuinfo output of my test machine:

model name : Intel® Xeon® CPU E5-1630 v3 @ 3.70GHz

I have 8 total threads.

First compile option:

“-Mextend -mcmodel=medium -fast -Mfprelaxed -Mipa=fast,inline”

The execution time was 2511 seconds.

Adding the openMP directives:

“-Mextend -mp -fast -Mfprelaxed -Mipa=fast,inline”

516 seconds on 8 threads. The output is identical as above.

Throw in the -Mvect=simd:256 option:

“-fast -Mfprelaxed -Mipa=fast,inline -mcmodel=medium -mp -Mvect=simd:256”

515 seconds on 8 cores.

Leave out the fprelaxed and Mvect=simd:256 options:

“-Mextend -mp -fast -Mipa=fast,inline”

536 seconds on 8 threads. The output is not identical. The code reads in initial conditions and some observed data. Given the initial conditions, it computes a model and does a chi^2 test with the data. The chi^2 values change by about one millionth of 1%. This is probably too small to worry about.

Here is the case using the flags I normally use:

“-Mextend -O3 -mp -mcmodel=medium -tp=p7”

598 seconds on 8 threads. Again, the output is not identical, but probably not different enough to care.

So I have gained a speed-up of a factor of 1.15, which is not bad.

By the way, if I don’t use optimization, I get an execution time of 811 seconds for 8 threads. So the overall speed increase from no optimization to the optimal optimization is a factor of 1.57.

In all of the cases using openMP, the output chi^2 values are very close, but not exactly the same. Is it possible to determine which of these are more “correct”?

Jerry

My understanding is that one cannot put subroutine calls into the parallel loops when using openACC. If so, this would rule out its use for me at the moment.

You can as of the OpenACC 2.0 standard using the “routine” directive.

“-Mextend -O3 -mp -mcmodel=medium -tp=p7”

By targeting a P7, you were not taking full advantage of your Haswell. The reason to target a “p7” is if you need backwards compatibility and will be running the resulting binary on an older system. However if this is the case, you can take advantage of PGI’s Unified Binary code generation where multiple targets can be used in a single binary. For example “-tp=p7,haswell-64”.

In all of the cases using openMP, the output chi^2 values are very close, but not exactly the same. Is it possible to determine which of these are more “correct”?

You would need to do the numerical analysis.

  • Mat