Polyhedron benchmark

The new updated version of well known Polyhedron fortran compilers benchmark was released few days ago.
See: http://www.polyhedron.com/compare0html
Frankly, as an PGI user , I am a bit surprised!?

Any comments from PGI guys?

Michal

Hi Michal,

Thanks for the pointer. I took a look and found that most of the discrepancy between the results are due to the use of Auto-parallelization. They only added the optimization to the Absoft and Intel compilers leading to skewed results. Adding our auto-parallelization flag “-Mconcur=innermost,allcores” will match performance of Air, GAS_DYN, and RNFLOW. I’ll contact the Polyhedron folks and see about updating the PGI flag set being used.

It does appear that Intel made a huge jump in performance with INDUCT which we’ll need to take look at.

  • Mat

The auto-paralleization flag was not used intentionaly:

“The settings used for the Intel and Absoft compilers enable autoparallelization. Autoparallelization settings are not used on any other compilers because we found that they produced no significant performance benefits on this benchmark set.”
See: http://www.polyhedron.com/pb05-linux-f90bench_p40html

I am not able to understand this approach of Polyhedron guys!

Michal

Is new PGI compiler release v11.0 able to eliminate huge paerformance gap of INDUCT code?

Sorry, No.

  • Mat

OK … release 11.2 is out. What is the current status regarding polyhedron fortran benchmark?

Are there still so huge performance gaps?

Michal

Hi Michal,

Currently we have an application engineer dedicated to investigating general auto-parallel performance. Induct is one of several applications that he is investigating. The process does take time since our concern is not just a single benchmark but how optimizations effect a wide variety of customer codes.

Though, customer input is important to us in prioritizing tasks. In your opinion, how important is it that PGI be able to auto-parallelize the Induct benchmark?

Thanks,
Mat

Mat,

I am permanently looking for best compiler suite (C/C++, Fortran, Profiler, Debugger, etc.) and one of most important feature is the fact, that compiler will be able to produce the binaries as fastest as possible. From this point of view the PGI fortran compiler is just now #3 (after absoft and intel). I made a several more or less comprehensive benchmarks (Polyhedron is only one of them) and a few benchmarks used my most important fortran codes. The PGI is robust and reliable compiler but does not produce the binearies which are able to fully exploit computing power of recent CPUs.

Actually, the current PGI fortran compiler produce in general (on INTLEL CPUs + Linux) binaries which are typically always slower than binaries produced by INTEL or Absoft compilers. And sometimes the performance gap is really very significant (30-250%).

The auto-parallelization is extremely important compiler feature for legacy codes, because there is no chance to expect that authors will be able to rewrite this codes for current parallel architectures.

Michal

Hi Michal,

I spent some time recreating the posted Polyhedron results. It happens that I have an Dell XPS Intel Core i7 920 which very similar to what Polyhedron used. I got slightly better results then they did, but it’s within run-to-run variance, especially given the short run time of these benchmarks. Like most codes, PGI will be faster on some while Intel will be fast on others. Overall, when Auto-parallization is also used for PGI, Intel is only ~3% faster. The only outlier is Induct which we are investigating.

Polyhedron is an important metric of perceived performance. However, the problem we face is how to balance our priorities between this perceived performance and the actual performance of our customer codes. If a benchmark is spead-up because of a ‘creative’ optimization that has no effect on all but a very few customer codes, should we spend our engineering time implementing these optimizations? Unfortunately, the answer is yes because not doing the optimization is detrimental to the perceived performance of the compiler, though not the actual. We just put a lower priority on these types of optimizations.

What does concern me is when you say that PGI is consistently slower than Intel. Is this with your own code? or is this based solely on benchmarks? If it is your own code, would it be possible to send us a representative example to better understand where the performance difference occurs?

Note that PGI is typically more conservative in regards to numerical accuracy and keeps within 1Ulps even with “-fast”. Intel’s “-Ofast” flag is roughly equivalent to PGI’s “-fast -Mipa=fast, inline -Mfprelaxed” where “-Mfprelaxed” will use less precise (up to 3Ulps off) fp operations. It’s very possible that the performance difference you are seeing is simply due to the optimizations being used.

Thanks,
Mat

Apologies for the poor formatting

  • PGI Serial PGI Parallel Speed-Up
    ac 10.18 10.42 -2.30%
    aermod 16.2 16.51 -1.88%
    air 5.42 3.53 53.54%
    capacita 29.72 31.17 -4.65%
    channel 2.25 1.52 48.03%
    doduc 24.21 25.66 -5.65%
    fatigue 6.11 5.93 3.04%
    gas_dyn 3.55 2.18 62.84%
    induct 27.1 28.07 -3.46%
    linpk 7.82 6.51 20.12%
    mdbx 12.31 10.13 21.52%
    nf 11.06 10.03 10.27%
    protein 36.13 37.69 -4.14%
    rnflow 24.18 17.88 35.23%
    test_fpu 6.07 5.16 17.64%
    tfft 2.13 2.23 -4.48%
    Geo Mean 10.01 8.83 13.37%


    PGI Parallel Intel Parallel Difference
    ac 10.42 9.81 -5.85%
    aermod 16.51 13.96 -15.45%
    air 3.53 2.83 -19.83%
    capacita 31.17 28.05 -10.01%
    channel 1.52 1.82 19.74%
    doduc 25.66 25.88 0.86%
    fatigue 5.93 11.54 94.60%
    gas_dyn 2.18 2.57 17.89%
    induct 28.07 8.69 -69.04%
    linpk 6.51 8.13 24.88%
    mdbx 10.13 10.11 -0.20%
    nf 10.03 9.91 -1.20%
    protein 37.69 30.85 -18.15%
    rnflow 17.88 18.03 0.84%
    test_fpu 5.16 5.69 10.27%
    tfft 2.23 2.23 0.00%
    Geo Mean 8.83 8.51 -3.64%

Times are in seconds.

Flags:
PGI Serial: -Bstatic -V -fastsse -Munroll=n:4 -Mipa=fast,inline
PGI Parallel: -Bstatic -V -fastsse -Munroll=n:4 -Mipa=fast,inline -Mconcur=innermost
Intel Parallel: -O3 -fast -parallel -ipo -no-prec-div

OMP_NUM_THREADS set to 4.

Hi guys!

Did you see recent update of the Polyhedron Benchmark:
http://www.polyhedron.com/compare0html

Of course, every benchmark is more or less specific and corresponding results are not valid in general. But I think, that PGI should start to work on code optimization improvement, because in other case the differences going to be bigger.

Hi Michal,

Thanks for the link. John Appleyard had let us know late last week that they were refreshing Polyhedron but had not given us his results nor access to the code. I just started looking at the new codes to determine how much of the difference is due to flag selection and will let you know what I find. I also let our application engineering team about the results and they will start to take a look as well.

FYI, our apps team determined that we are missing some opportunities for vectorization in INDUCT and our compiler team is currently implementing these recommendations.

Best Regards,
Mat

Are these new opportunities for vectorization implemented in PGI 11.10?

Michal

Hi Michal,

11.10 only contains a few small changes mostly for CUDA. We’re working on general enhancements to vectorization which should available in the 12.0 release.

  • Mat

OK … release 12.1 is out, congratulation!

What about previously announced vectorization enhancements? Are they already implemented?

Is there any significant impact on polyhedron benchmark?

Michal

Hi Michal,

Our compiler engineers decided to prioritise some general optimisation opportunities over the specific ones found in INDUCT and MP_PROP_DESIGN. So while I do show a bit of an increase over 11.10 (~6%), the large deltas still do exist. I’ll keep bring it up in our weekly performance meetings.

  • Mat

OK, well … After a year I again checked the current status regarding benchmark comparisons PGI vs Intel vs GFortran.

I do not understand why the PGI guys are still in theirs “nirvana”, because the binaries produced by PGI are just now systematically slower than binaries by Gfortran!!! And, as always, far more slower then binaries by Intel compiler!!!

See: http://polyhedron.com/pb05-lin64-f90bench_SBhtml and
http://www.nersc.gov/users/computational-systems/hopper/performance-and-optimization/compiler-comparisons/.

During the last year there was more or less zero progress on PGI binaries execution speed!!! On the other hand the Intel and GFortran compilers are doing faster and faster binaries.

So my final question is:
As a regular commercial user of PGI products I need to know if the PGI developers understand this situation and what we can expect in the near future?

Hi Michal,

First, I do appreciate you pushing on Polyhedron performance. We do have analysis and action plans for the discrepancies. However, when prioritizing our compiler performance team’s time, it’s necessary to put Polyhedron lower on the list. Not that Polyhedron isn’t important, but rather we focus first on the performance of HPC community applications, followed by wider used HPC industry standard benchmarks such as SPEC OMP2012, SPEC MPI2007, SPEC CPU2006, NPB, etc. Note in today’s performance meeting, I did bring up your concerns.

The question to you is how are we performance wise on your codes? If we are not performing well on those, please send us reproducing examples. We would be interested in investigating them.

Note for a comparison of OpenMP application performance, please see http://www.pgroup.com/benchmark/specomp/pgi.htm

As for the NERSC results, I’m currently in direct contact with the person who ran these comparisons. My internal testing on our cluster (128 core Intel Sandybridge) shows PGI on-par or exceeding Intel and given Hopper is an AMD system, the NERSC results should be even better. We are currently working with them to understand where the discrepancy occurs. Last week they promised to rerun NPB LU-Class E at 128 cores and send me the results so I can compare them to mine. I’ll ping Mike again today to see where he’s at on this.

While it’s too early to conclude, at this point it is my contention that there’s a problem with how the benchmarks are being run at NERSC rather than a problem with the PGI compilers. If I’m correct, we will work to get these results updated. Granted, if it does turn out to be problem with the PGI compilers, this would be a high priority item for us.

Bet Regards,
Mat