poor pgi openmp performance??

I have a fortran CFD program parallized by openMP. When it is compiled by Intel fortran, i can achieve a speedup of almost 10 on 2 Intel Xeon X5670 CPUs which containing 12 cores. But when i compile it by pgi (version 11.8), i can only achieve a speedup of less than 5. I use the two compilers with -O3 option. For the sequential program, i observe that pgi fortran is about 20% slower than inter fortran. More surprisingly, if i use -fast option of pgi compiler, i cannot get the right result with 12 openMP threads, but it is still normal when the number of threads is less than 12.
So what is the difference of implementation between intel openMP and pgi openMP??Anybody can give me some advice about how to improve pgi openMP performance ???

Hi Steve,

I use the two compilers with -O3 option

Our -O3 is not the same as Intel’s. Currently, -O3 is really the same as -O2. We will sometimes use -O3 to put new optimization that might impact numerical accuracy but these typically get moved into -O2 once they have been vetted. The more equivalent PGI flag to Intel’s -O3 is -fast and this difference in optimization could account for the 20% difference if not more.

More surprisingly, if i use -fast option of pgi compiler, i cannot get the right result with 12 openMP threads, but it is still normal when the number of threads is less than 12.

That is odd. While the optimization that the compiler uses can impact numerical accuracy, even with -fast we stay within 1 ulp of accuracy. Though while parallel execution can change accuracy due to the order of operations, it’s not clear why this would only occur over 12 threads and with -fast. You’ll need to do some digging.

-fast is an aggregate flag made up of other optimizations. So what you can do is eliminate them one by one to see which optimization is giving you the verification error. The specific flags included in -fast can change from target to target, but here’s a general list:

-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz -Mpre

Note that you’ll most likely want to characterize the verification error. If you determine which optimization causes the issue, the next step is to use a binary search method to compile half of your code with the optimization and half without. Continue until you narrow down the particular file that’s causing the error. Next, you can use the “opt” directive/pragma to disable optimization for a particular routine and again use a binary search method to find the particular routine where the verification error occurs.

While this type of issue could be the compilers fault, often I see cases where the user’s program has an error that doesn’t get exposed until a particular optimization is used. Things such as out-of-bounds errors and uninitialized memory are the typical cause.


So what is the difference of implementation between intel openMP and pgi openMP??Anybody can give me some advice about how to improve pgi openMP performance ???

Without a detailed performance analysis it’s impossible to know exactly what’s going on. My first thought is that it’s the difference in optimization so the simplest thing to try different optimizations and see how they effect your code. Though a more thorough analysis would be to use a profiler to determine where most time is spent. Compare the PGI and Intel profiles to determine where the differences occur. Next, use the compiler feedback messages (-Minfo) to determine which optimizations are being applied and more important, which are not. In particular, look for messages about code not vectorizing (-Mvect).


Hope this helps,
Mat

It looks like that INTEL compiler is significantly better than PGI compiler … see for example: https://forums.developer.nvidia.com/t/polyhedron-benchmark/132114/1

Hi Michal,

Yes, Intel is currently faster then PGI on the Polyhedron benchmark. However, this may or may not have any barring on Steve’s code and it’s best to do a performance analysis to understand what’s happening.

  • Mat

Thanks Mat.
Thanks Michal.
Our CFD program contans nearly 20000 lines of fortran codes. We implemented MPI, OpenMP and CUDA parallel computing in the program. Since PGI is the only fortran compiler supporting CUDA fortran, currently we have to use PGI compiler. Otherwise we would have to write mixed language code contains CUDA C and Fortran. Unfortunatly i do find sometimes PGI is slower than intel when comparing the original MPI,OPENMP parallel implementation of our CFD code.
As for OpenMP, this performance gap is even bigger regards to our code, i am not sure if we have used appropriate complier optimization flags, or Could you give me some general advices about performance optimization while compiling or writing openMP codes in PGI fortran??

thanks all

I would recommend starting with the following compilation flags:

-mp -fast -Mipa=fast,inline

There are a number of things that can effect the runtime performance.

*) Thread to processor core binding. Depending on what type of system you are running
on, and the number of OpenMP threads, the placement of threads on cores can make
a significant difference. This binding can be controlled using the PGI environment
variables, MP_BIND and MP_BLIST.

*) Is the system you are running on a “NUMA” system? Does it have multiple processor
sockets? If so, then you also have to consider the NUMA effect.

As Mat had suggested, without a detailed performance analysis, its very difficult to know exactly what is the cause.

Thanks toepfer.
Our system is NUMA. Each node contains 2 processor sockets and each socket contains a Xeon 5670.

I just complie our CFD code with both Intel and PGI compilers, without thread to processor core binding for each compiler.

When you run on one of these nodes, how many MPI processes do you run with? Do you set the OpenMP environment variable OMP_NUM_THREADS?

I just use one MPI process with 12 OpenMP threads. I use -O2 flag for pgi and -O3 flag for Intel. Surely i set OMP_NUM_THREADS to 12 for both intel and pgi. And i finally find intel OpenMP is almost 2 times faster than pgi with 12 threads. For sequential code, i can only see 30%-50% performance gap between intel and pgi as regard to our code.

Using just the -O2 flag for PGI does not enable automatic vectorization, whereas using the -O3 flag for the Intel compiler does. A better flag to use for PGI instead of -O2 is -fast. This will enable automatic vectorization as well as other optimizations and more closely matches that of Intel’s -O3 flag.

Thanks toepfer.
But i still cannot see a performance enhancement with -fast. Here i give you some details about our parallel computer and program.
Our parallel computer contains many computing nodes, and each node contains 2 Intel xeon 5670 processors. These two xeon 567 processors are pluged in 2 different sockets and each contains 6 cores, thus each node gets 12 cores with a 48GB memory shared between the 12 cores.
Our program is a real-world numerical application program for computational fluid dynamics (CFD), not a benchmark. We are writing some MPI, OpenMP and CUDA codes to parallelize the program. Since our CFD program is written in Fortran and currently PGI is the only compiler that supports CUDA FORTRAN, we choose PGI, specifically PGI 11.8, for our work.
Below is some results of performance comparison between Intel compiler and PGI 11.8 as regard to our program:
--------------------------- 1 thread------6 threads-----12 threads----
Intel-omp-O3 |-- 23 s --|-- 5 s --| – 3 s —
Pgi-omp-fast |-- 28 s --|-- 10 s --| – 8.4 s —
Pgi-omp-fast-ipa |-- 26.5 s --|-- 8.5 s --| – 6.7 s —
pgi-omp-fast-ipa-bound |-- NA --|-- 6.8 s --| – NA —

compiler flag:
Intel-omp-O3: -O3 -openmp
Pgi-omp-fast :-fast -mp
pgi-omp-fast-ipa: -fast -mp -Mipa=fast,inline
I just running the program on one computing node with one MPI process and 1, 6 and 12 threads.pgi-omp-fast-ipa-bound means we set MP_BOUND and MP_BLIST in pgi. We do find that binding thread to processor core can improve openMp performance in PGI, but even in this case pgi is still nearly 40% slower than intel. What surprise us is that the performance can be worse if we set MP_BOUND to “yes” when the number of openMP threads is 12.

The performance gap between pgi and intel for sequential program is not too large, roughly 10% to 20%. Not too bad for our program. But i have no idea that why the parallel openMP performance gap of the two compilers is so huge: Intel is even 2 times faster than pgi, and you can also see a good parallel scalability for Intel.

Ok, toepfer. I just compiled the same source code with the two compilers and different compiler flags, and got the result above. I just cannot explain the abnormal performance gap of the two OpenMP implementations. Would you and any guys can help me??

Hi Steve,

I think Craig has taken it as far as we can online. If you can send us the code (trs@pgroup.com) then I’d be happy to spend an hour or two investigating the performance. Though, you may also want to try profiling your code to to get a better idea of where the performance differences occur. You can then use the compiler feedback messages (-Minfo) to see what optimizations are being applied to this section of code. Pay particular attention to any messages which show places where the compiler attempted but failed to optimize a section of code.

  • Mat

Thanks Mat.
I cannot send you our source code. But i can send you the pgi compiler output by -Minfo, Please check your mail. I hope you can find some usefule information from this output. Are there any other profiler tools from pgi can obtain additional performance information? I am not sure if “-Minfo” option can provide you enough message.
BTW, I can improve pgi OpenMP performance in a single CPU socket. I set MB_BIND=Y and set MB_LIST=5,4,3,2,1,0 since our Intel Xeon 5670 have six cores. It do have an effect to OpenMP performance. But How can I binding thread to processor cores when there are 2 cpu in 2 different socket???
I donot know how to set MB_LIST.

hi Mat.
I have just send you a code that can reproduce this openmp performance problem; please check the mail.
wish you could give me some advice; thank you

Thanks Steve. I see that Customer Service forwarded your code to Craig for further investigation. Craig is very good at diagnosing OpenMP performance issues (much better than me) so you’ll be in good hands.

  • Mat

Hi, Mat
Do you and your colleague have any update for my question??
Thanks
Steve

No, sorry.

I spoke too soon. Craig sent me the following:

Hi Mat,

I have an update for this customer request. The issue with the code
that the customer shows is how we handle nested task regions up to
and including the 12.6 release.In this case, some piece of code goes
parallel, then ends up generating ONE task within this parallel region and
this task is registered in an OMP single region. See code below. The only thing
this task does is call a subroutine called ‘test’. So what happens is, a
single OMP thread registers the task. Then an OMP thread ends up
executing the task, which is just a call to subroutine test. Within this
subroutine, this OMP thread encounters more task regions. Since these
are nested, we choose to execute them immediately and thus end up
running them serially.

I tested this with the soon to be released 12.8 compiler, and we now support
nested task regions with a few exceptions. So my recommendation would be
for the customer to download 12.8 and try running their code again. If it still
does not show any significant speedup when run with more than one OMP
thread, we will need to get more details on the exact code pieces where they
are using task regions to see if they may be falling into one of the exceptional
cases.

-craig