Recommended flags for 64-bit Xeon?

Can anyone recommend a good set of optimisation / machine type flags for pgcc under linux on a 64-bit Xeon? Looking through the compiler flag list, I’m getting a reasonable flops benchmark from “-tp p7-64 -O4”, which is the specific arch I’m using. Any advancement on that is appreciated - there’s a lot of flags to try, and I suspect flag ordering may affect the results, too!

Hi,

In general, the flag set that gives the best performance for C and Fortran with release 5.2 is “-fastsse -Mipa=fast,inline”. C++ tends to do better with “-fastsse --no_exceptions -Minline=level:10”. Note that “-tp p7-64” is on by default when you compile on an 64-bit Xeon system. “-fastsse” combines all of the most common optimizations under a single flag.

“-fastsse” == “-O2 -Munroll=c:1 -Mnoframe -Mlre -Mscalarsse -Mvect=sse -Mcache_align -Mflushz”

You can also then add “-O3” (-O4 is really -O3), which may or may not help. Also try prefetching with “-Mprefetch” and “-Mvect=prefetch”.

Note, if your using a “stream” benchmark, aggressive optimization doesn’t help since the benchmark is memory bound. The best flags for stream are “-O2 -Mvect=sse -Mnontemporal -Munsafe_par_align”. Note that “-Mnontemporal” can hurt general applications and “-Munsafe_par_align” is called unsafe for a reason.

Other things to try when tuning for performance is to use the profilier, pgprof. Compile and link your code with “-Mprof=lines”, run the program, and view the results with pgprof. This will give you a better understanding of where your code takes the most time and where you should focus your tuning efforts.

The bottom line is to start with “-fastsee -Mipa=fast,inline”. Does this perform well enough? If so, great your done. Otherwise, find the parts of your code that’s not performing well, determine why, then determine if other compiler options will work better or maybe some code rewritting will help.

Flag order does matter in some cases. In general, the last flag will override previous conflicting flags. So “-fastsse -O3” is different than “-O3 -fastsse” since -fastsse implys -O2 and -O2 will override -O3. The exception to this is “-Mvect” which adds suboptions together. So “-Mvect=sse -Mvect=prefetch” == “-Mvect=sse,prefetch”. If you wanted fastsse and only the prefetch option you would need to add nosse, “-fastsse -Mvect=nosse,prefetch” since -Mvect=sse is part of fastsse.

Hope this wasn’t too long winded. Let me know if you need anything clarified.

-Mat

To be perfectly clear, ‘-tp p7-64’ is the default on a 64-bit Xeon system if the 64-bit compiler is on your path. When you install the PGI compiler suite, you can install both 32-bit and 64-bit compilers; the 64-bit compilers will go in
/usr/pgi/linux86-64/5.2/{bin,include,lib,…}
and the 32-bit compilers go in
/usr/pgi/linux86/5.2/{bin,include,lib,…}
Assuming you install in /usr/pgi (your prefix may be different).
If you put /usr/pgi/linux86-64/5.2/bin on your path, you get the 64-bit compilers by default; if you put /usr/pgi/linux86/5.2/bin on your path, you get the 32-bit compilers by default. The -tp switch will override the default, of course.

Thanks for the prompt reply folks!

I’ve been trying out various sugested options, and I’m not seeing a performance improvement (and in some cases I’m understandably seeing performance degradation). The best results I’m generally seeing are when I don’t use any compiler flags at all!

FTR, I’m using Al Aburto’s flops benchmark. A fairly simple benchmark, but I’ve analysed it under other architectures, so I know it’s possible to optimise it to take advantage of superscalar architectures.

It seems odd that I’m seeing little performance difference with any optimisation flag. I ran a vanilla compile with the -# option, and it doesn’t seem to be auto-selecting optimisation flags. Adding -Minfo shows that I’m getting at least loop-unrolling, but without any apprarent benefit!

I’ll try it out on a ‘real’ piece of code before my trial licence expires, but I wondered if anyone had any suggestions for troubleshooting?

Thaks,
Mike.

I’d need to study FLOPS further to understand what’s going on. I tried both gcc and icc 8.1, and got the same results where optimization didn’t help. However, when I compiled it with cc on a Sun I saw improvement with optimization. Maybe its an architectural issue?

-Mat