longer execution time in PGCC 6.0.5 than PGCC 5.2.4

Hi,

I compiled a benchmark program by pgcc 6.0.5 and pgcc 5.2.4
[/url]http://w3cic.riken.go.jp/HPC/HimenoBMT/Load_module/cc_himenoBMTxp_l.lzh
on a 2 AMD opteron 250 CUP machine and run it.

% pgcc -fastsse -Mconcur -DLARGE himenombmtxps.c

The measure by the benchmark program shows that it runs at 1364 Mflops for pgcc 6.0.5
while at 1653 Mfops for pgcc 5.2.4, about 20% faster. If only use single CPU

% pgcc -fastsse -DLARGE himenombmtxps.c

both run at about 1160Mflops with several percent difference.

I tested several compiler options described in Users’ Guide but I could not
run the benchmark test compiled by pgcc 6.0.5 as fast as by 5.2.4.

Do you know why this benchmark program compiled -Mconcur by pgcc 6.0.6
is signigicanly slowe than compiled by pgcc 5.2.4?

Hi Shingo,

Thank you for the report. I was able to recreate the issue here and was able to isolate the problem. With the 6.0 compilers we added an optimization which better recognizes idioms. Although this optimization helps most codes, in your case it causes the loop at line 223 to no longer parallelize since it now contains a call to the “memcopy” idiom. (The compiler wont parallelize loops with funcion calls).

As part of our current work on auto-parallelization, we have addressed this problem and will have a fix in the 6.1 release. For now however, you can add the xflag “-Mx,8,0x8000000” to the compilation to remove the idiom. With the xflag, I show the MFlops increases from 1413 to 2235. Xflags can change from release to release so you should only use this work around with the 6.0 compilers and this particular benchmark.

FYI, to determine which loops are and are not parallelized, add the flags “-Minfo -Mneginfo=concur” when using “-Mconcur”.

Thanks,
Mat

Thank you for the quiick fix.
Another observation for the current PGCC 6.0.5 with -Mconcur option is that
without your instruction -Mx,8,0x8000000,

% pgcc -Mconcur -DLARGE

runs faster by 10 % than

% pgcc -fastsse -Mconcur -DLARGE

for the same benchmark program. The -fastsse option does not always help, but seems sometimes slow down the execution.

Shingo

Hi Shingo,

It appears that cache alignment (-Mcache_align) is causing the problem. Try compiling with “-fast -Mvect=sse” which is -fastsse without -Mcache_align.

“-fastsse” is an aggregate flag composed of the optimizations that help most codes. In some cases however, certain optimization can hurt performance. If you notice such a case, try breaking up an aggregate flag into its components to determine which optimizations help and which hurt. To get the component list use “-help” flag along with the flag. Note that specific component flags can change.

Example:

pgcc -help -fastsse
Reading rcfile /usr/pgi/linux86-64/6.0/bin/.pgccrc
-fastsse            == -fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz
-fast               Common optimizations: -O2 -Munroll=c:1 -Mnoframe -Mlre
-M[no]vect[=[no]altcode|[no]assoc|cachesize:<c>|[no]idiom|levels:<n>|nosizelimit|prefetch|[no]recog|smallvect:<n>|[no]sse|[no]transform]
                    Control automatic vector pipelining
    [no]assoc       Allow [disallow] reassociation
    cachesize:<c>   Optimize for cache size c
    [no]idiom       Enable [disable] idiom recognition
    prefetch        Generate prefetch instructions
    [no]sse         Generate [don't generate] SSE instructions
-M[no]scalarsse     Generate scalar sse code with xmm registers; implies -Mflushz
-Mcache_align       Align long objects on cache-line boundaries
-M[no]flushz        Set SSE to flush-to-zero mode
  • Mat