tvmet, blitz large functions and inline

Hello,

I am currently evaluating pgCC. I am encountering problems however.

  1. Blitz does not compile:

“…/blitz/mathfunc.h”, line 2410: error: no instance of overloaded function
“std::sin” matches the argument list
argument types are: (std::complex)
{ return BZ_CMATHFN_SCOPE(sin)((complex )x); }

(replace std::sin with std:: sqrt, std::sinh etc to get the full error list) Is it a Blitz or Portland problem?

  1. I have a project with large math functions. Those functions are exported with CForm from Mathematica and have a from 300 to over 8000 lines of C code. The functions look like this:

double fun(const tvmet::Vector<double, 4>& x, const tvmet::Vector<double, 4>& xo)
{
300-8000 lines of math
return result
}

For some reason I get really bad performance from pgCC. First I get lots of warnings

PGCC-W-0278-Can’t inline …tvmet25vector

then I get 0.34 seconds on 40000 functions evaluations on my athlon xp. g++ gives me 0.04 seconds with no warnings. Here are my portland optimiztation flags:

-fast -tp k7 -fastsse -Mipa=fast,inline -Minfo -Mnoframe -Minline -O3 -Minline=levels:100 --no_exceptions

and here are my g++ flags
-O3 -ffast-math -fomit-frame-pointer -march=athlon-xp -pipe

Please advise on how to use the portland compiler efficiently.

Thanks,

Paul

Hi Paul,

Question 1: Compiling Blitz

The error your seeing stems from an incompatability we had on some older linux systems. To work around this, modify the “/usr/pgi/linux86//include/CC/stl/_site_config.h” file by commenting out the following line:

#define _STLP_NO_LONG_DOUBLE 1

to

//#define _STLP_NO_LONG_DOUBLE 1

Note this is only needed for 32-bit systems. Also, this may cause other issues on some Linux distributions. One of our engineers is looking into this and hopes to have fix in a future patch release.

With this change you’ll now get a different error in which you’ll have multiply defined reference to “pow”. To fix this, change your “blitz/config.h” include file to undefine “BZ_MATH_FN_IN_NAMESPACE_STD”. I was able to successfully compile and run the example tests once I made these changes.

Question 2:

Let’s try simpifing your compilation flags. The Athlon XP does not have SSE2 so you should use just “-fast”. (Note that -fast is part of -fastsse so you don’t need both.) “-Mnoframe” is part of “-fast” so is not needed. You should use either IPA inlining or Minline, but not both. Also, 10 levels of inlining is all you’ll need. Having 100 levels will take a extremely long time to compile. So this leaves

-fast -O3 -tp k7 --no_exceptions -Minline=levels:10

Let me know if this helps your compilation time and allows more functions to be inlined.

Thanks,
Mat

Hi,

thanks for your fast answer.

Question 1)

I changed the _site_config.h file. I’m now getting the mulitply defined reference to “pow” However undefining “BZ_MATH_FN_IN_NAMESPACE_STD” in blitz/config.h doesn’t get rid of the errors, though.

Question 2)
With the new flags there is still no inlining happening, which is strange since I am only using operator () from tvmet. The second thing that doesn’t get inlined comes from a Macro

#define Power(x, y)	(mypow<y>(static_cast<double>(x)))

where mypow is of the following form

template<int order>
inline	double mypow(double arg);

template<>
inline	double mypow<2>(double arg)
{
	return (arg * arg);
}

template<>
inline double mypow<3>(double arg)
{
	return (arg * arg * arg);
}

etc.

(the powers are guarantedd to be integers).


I very much appreciate your help,

Paul

Hi Paul,

Q1) What version of the compiler and which OS are you running? I’m on a SuSE9.0 using the PGI 6.0 compilers. It could be that a different OS will need a different “fix”.

Just to double check, you changed the “#define” to “#undef”, not just comment out the statement?

Q2) I took your example code and made this small test program:
x.cpp:

#include <iostream>

template<int order>
inline   double mypow(double arg);

template<>
inline double mypow<2>(double arg)
{
   return (arg * arg);
}

template<>
inline double mypow<3>(double arg)
{
   return (arg * arg * arg);
}

#define Power(x, y)   (mypow<y>(static_cast<double>(x)))

int main () {
  double res[10];
  for (int i=1; i <= 10; ++i) {
    res[i-1] = Power(i,2);
  }
  for (int i=0; i < 10; ++i) {
    cout << i+1 << "^2=" << res[i] << "\n";
  }
}

Compiled with your flags (I also added “-Mkeepasm -Manno” so we could view the assembly file):

sagebrush:/tmp% pgCC -fast -O3 -tp k7 --no_exceptions -Minline=levels:10 -Mkeepasm -Manno -Minfo x.cpp
main:
    22, Loop unrolled 4 times

Looking at the assembly we can see that “mypow” is getting inlined:

#   for (int i=1; i <= 10; ++i) {
#     res[i-1] = Power(i,2);
#   }
.LB808:
# lineno: 22

        movl    %ebx,-96(%ebp)
        fildl   -96(%ebp)
        fmul    %st(0),%st
        movl    -92(%ebp),%eax
        fstpl   -24(%eax)
        movl    %ebx,%edx
        incl    %edx
        movl    %edx,-100(%ebp)

This means that something else is inhibiting the inlining or I’m not correctly using your example. Is it possible to get the full source?

Thanks,
Mat

Hi Mat,

Q1) I’m running Gentoo Linux and the newest version of PGI (6.01 ?). I downloaded last sunday. I did #undef the statement, not just comment it out, yet I_m getting the same error of the multiply defined pow’s.

Q2) I can send you the full source. I hope a kdevelop project is convenient. Where can I send it to?

Thanks again for your time,

Paul

Hi Paul,

Unfortunately, I don’t have Gentoo installed here since it’s not one of our supported systems. I’ll send you my config.h file, perhaps the configure script made some other changes as well.

I’ll email you where you can send the code and I see what I can determine!

Thanks,
Mat

This turned out to be a very interesting problem. Paul’s code contains a single expression that is ~600 lines long and has several hundred inlined functions. The sheer volume of inlined functions in the expression causes the code size to grow past the maxium size allowed by the inliner. The inliner has a hard limit as to the maximum code size allowed since if left uncheck, the compilation time for most codes can grow exponentially.

I’ve filed a techincal problem report and asked our compiler team to review his code since it illustrates one area in which we can improve our C++ performance. As a work around, I modified his code by assigning several of the redunant function calls to local variables and using the local variables in place of the function calls in the expression. This reduced the size of the code enough for the remaining functions in the expression to be inlined, and hopefully get the performance where it should be.

Thanks!
Mat

Hi Mat,

thanks for your help. I will implement your hint with the locals.

Cheers,

Paul