compiler generating a "scenic route"

I have been doing some basic testing with the 12.4 pgfortran compiler on a Windows7 64-bit box (compile options = -Mextend -Mr8 -O2 -fast), including some comparisons with gfortran. For a matrix multiply comparison, pgi was a bit over 5 times faster, a decent result. However for another case, the result was vastly worse.

This code is down to its minimalist configuration. The kindest description would be to call the result “pathetic”, as it was 15 times slower than gfortran. And if you wrap it in pgcollect, it takes another 30 times longer to run, being 460 times slower than gfortran. I have no idea what the compiler is doing, but it is certainly taking the long road to get there.

I did some testing and have determined that the main source of the problem is the reshape function. pgi appears to totally get lost in there somewhere, though when it finally comes out, the answer is correct. A minor factor is the use of temporary files, which appears to add about a factor of 2 to the run. (Note that this test has tiny arrays and a good chunk of the time is setup time.)

So until and unless this slow performance is resolved, there are 3 things to remember. (1) do NOT use the reshape function. (2) try to avoid temporary arrays. (3) if you ignore item 1, do not ever even think about using pgcollect.

Do you have any thoughts as to why this poor performance might be occuring? I know how to avoid these particular items, but one always wonders if the cause of this issue is lurking elsewhere as well.

Thanks.

-alan

Hi Alan,

This is a known performance issue that we’ve been working on correcting for some time (TPR#18097). As you note, the issue is that we will always create a temp array to store the results of RESHAPE. This is correct behavior, however, the performance can be quite poor especially if the arrays are small and the RESHAPE is called many times.

In 12.5, we will add a new optimization which will eliminate the need for a temp array when the source and shape are present and the source is contiguous. The caveat is that this optimization does not cover all cases, such as if order or pad are used, so may or may not help the performance of your code as well. If you can, please send us a reproducing example (trs@pgroup.com) so we can either confirm that this fixes your issue or add a new problem report.

As for pgcollect, I’m not sure what’s happening there. It’s non-intrusive so shouldn’t impact the overall performance of the code. Having a reproducing example would help here as well.

Best Regards,
Mat

Mat,

OK, it looks like my “simple” test case has evoked the absolute worst combination for performance under your compiler: running arrays of 2x2 and 2x1 with a very large outer iteration loop. Well, test cases are supposed to be stressing…

I will ship you a copy of the stripped down code using self contained data.

Thanks for the info.

-alan

Hi Alan,

Yes, your code exhibits this pathological case. I tested with our pre-release version of 12.5 and show that your test is now over 2x faster with compiled with PGI 12,5 over gfortran (.92 secs versus 2.5 seconds) and over 40x faster than PGI 12.4 (42 seconds)

  • Mat