The measure by the benchmark program shows that it runs at 1364 Mflops for pgcc 6.0.5
while at 1653 Mfops for pgcc 5.2.4, about 20% faster. If only use single CPU
% pgcc -fastsse -DLARGE himenombmtxps.c
both run at about 1160Mflops with several percent difference.
I tested several compiler options described in Users’ Guide but I could not
run the benchmark test compiled by pgcc 6.0.5 as fast as by 5.2.4.
Do you know why this benchmark program compiled -Mconcur by pgcc 6.0.6
is signigicanly slowe than compiled by pgcc 5.2.4?
Thank you for the report. I was able to recreate the issue here and was able to isolate the problem. With the 6.0 compilers we added an optimization which better recognizes idioms. Although this optimization helps most codes, in your case it causes the loop at line 223 to no longer parallelize since it now contains a call to the “memcopy” idiom. (The compiler wont parallelize loops with funcion calls).
As part of our current work on auto-parallelization, we have addressed this problem and will have a fix in the 6.1 release. For now however, you can add the xflag “-Mx,8,0x8000000” to the compilation to remove the idiom. With the xflag, I show the MFlops increases from 1413 to 2235. Xflags can change from release to release so you should only use this work around with the 6.0 compilers and this particular benchmark.
FYI, to determine which loops are and are not parallelized, add the flags “-Minfo -Mneginfo=concur” when using “-Mconcur”.
It appears that cache alignment (-Mcache_align) is causing the problem. Try compiling with “-fast -Mvect=sse” which is -fastsse without -Mcache_align.
“-fastsse” is an aggregate flag composed of the optimizations that help most codes. In some cases however, certain optimization can hurt performance. If you notice such a case, try breaking up an aggregate flag into its components to determine which optimizations help and which hurt. To get the component list use “-help” flag along with the flag. Note that specific component flags can change.
Example:
pgcc -help -fastsse
Reading rcfile /usr/pgi/linux86-64/6.0/bin/.pgccrc
-fastsse == -fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz
-fast Common optimizations: -O2 -Munroll=c:1 -Mnoframe -Mlre
-M[no]vect[=[no]altcode|[no]assoc|cachesize:<c>|[no]idiom|levels:<n>|nosizelimit|prefetch|[no]recog|smallvect:<n>|[no]sse|[no]transform]
Control automatic vector pipelining
[no]assoc Allow [disallow] reassociation
cachesize:<c> Optimize for cache size c
[no]idiom Enable [disable] idiom recognition
prefetch Generate prefetch instructions
[no]sse Generate [don't generate] SSE instructions
-M[no]scalarsse Generate scalar sse code with xmm registers; implies -Mflushz
-Mcache_align Align long objects on cache-line boundaries
-M[no]flushz Set SSE to flush-to-zero mode