poor performance of pgi compared with gcc for complex

Hi,

I am surprised by the big performance difference for a rather simple code using multiplication of complex doubles, which makes me wonder whether I am using the right PGI compiler flags.
I am aware that the implementation of complex multplication has more to it than what you would do on paper, i.e. it typically includes some checks for NaNs etc., but I am still surprised about the difference between gcc and pgi. Do you have any explanation for my observation?

int main(){
  const int Nx = 10000;
  const int Ny = 10000;
  std::complex<double>** A = new std::complex<double>*[Nx];
  std::complex<double>** B = new std::complex<double>*[Nx];
  double** a = new double*[Nx];
  double** b = new double*[Nx];
  double** c = new double*[Nx];
  double** d = new double*[Nx];
  for(int i=0;i<Nx;++i){
    A[i] = new std::complex<double>[Ny];
    B[i] = new std::complex<double>[Ny];
    a[i] = new double[Ny];
    b[i] = new double[Ny];
    c[i] = new double[Ny];
    d[i] = new double[Ny];
    for(int j=0;j<Ny;++j){
      A[i][j] = std::complex<double>(1.,2.);
      B[i][j] = std::complex<double>(3.,4.);
      a[i][j] = 1.;
      b[i][j] = 2.;
      c[i][j] = 3.;
      d[i][j] = 4.;
    }
  }

  std::complex<double> sum(0.);
  auto start = std::chrono::system_clock::now();
  for(int i=0; i<Nx; ++i){
    for(int j=0; j<Ny; ++j){
      sum += A[i][j]*B[i][j];
    }
  }
  auto end = std::chrono::system_clock::now();
  std::chrono::duration<double> elapsed_seconds = end-start;
  std::cout << "Result: (" << std::real(sum) << "," << std::imag(sum) << ")\n";
  std::cout << "Wall clock time(standard): " << elapsed_seconds.count() << "s\n";

  sum = 0.;
  start = std::chrono::system_clock::now();
  for(int i=0; i<Nx; ++i){
    for(int j=0; j<Ny; ++j){
      sum += std::complex<double>(a[i][j]*c[i][j]-b[i][j]*d[i][j],
				  b[i][j]*c[i][j]+a[i][j]*d[i][j]);
    }
  }
  end = std::chrono::system_clock::now();
  elapsed_seconds = end-start;
  std::cout << "Result: (" << std::real(sum) << "," << std::imag(sum) << ")\n";
  std::cout << "Wall clock time(explicit): " << elapsed_seconds.count() << "s\n";


  for(int i=0;i<Nx;++i){
    delete [] A[i];
    delete [] B[i];
    delete [] a[i];
    delete [] b[i];
    delete [] c[i];
    delete [] d[i];
  }
  delete [] A;
  delete [] B;
  delete [] a;
  delete [] b;
  delete [] c;
  delete [] d;

  return 0
}

I use the following compiler command lines:

"/opt/pgi/linux86-64/17.7/bin/pgc++" -O3 -fastsse --c++11 [-tp=px] -Minform=warn -o pgi_vs_gcc_pgc++.exe pgi_vs_gcc.C
"/depot/gcc-5.2.0/bin/g++" -std=c++11 -msse -mfpmath=sse -O3 -o pgi_vs_gcc_g++.exe pgi_vs_gcc.C

and get the following timings:

$pgi_vs_gcc_pgc++.exe                                                                    
Result: (-5e+08,1e+09)
Wall clock time(standard): 1.16136s
Result: (-5e+08,1e+09)
Wall clock time(explicit): 0.311359s
$pgi_vs_gcc_g++.exe                                                                       
Result: (-5e+08,1e+09)
Wall clock time(standard): 0.436128s
Result: (-5e+08,1e+09)
Wall clock time(explicit): 0.19047s

It also seems that using no O level vs -O3 makes a 10x difference for gcc (actually -O1 is sufficient) whereas for pgi I don’t see any difference between w/ or w/o O level.

Thanks,
LS

I ran this on a system with g++ 5.2.1 and found, like you, about
a 3X performance between the two compilers.

I have filed TPR 24922 to look into this performance opportunity.

thanks,
dave