Hi,
I am surprised by the big performance difference for a rather simple code using multiplication of complex doubles, which makes me wonder whether I am using the right PGI compiler flags.
I am aware that the implementation of complex multplication has more to it than what you would do on paper, i.e. it typically includes some checks for NaNs etc., but I am still surprised about the difference between gcc and pgi. Do you have any explanation for my observation?
int main(){
const int Nx = 10000;
const int Ny = 10000;
std::complex<double>** A = new std::complex<double>*[Nx];
std::complex<double>** B = new std::complex<double>*[Nx];
double** a = new double*[Nx];
double** b = new double*[Nx];
double** c = new double*[Nx];
double** d = new double*[Nx];
for(int i=0;i<Nx;++i){
A[i] = new std::complex<double>[Ny];
B[i] = new std::complex<double>[Ny];
a[i] = new double[Ny];
b[i] = new double[Ny];
c[i] = new double[Ny];
d[i] = new double[Ny];
for(int j=0;j<Ny;++j){
A[i][j] = std::complex<double>(1.,2.);
B[i][j] = std::complex<double>(3.,4.);
a[i][j] = 1.;
b[i][j] = 2.;
c[i][j] = 3.;
d[i][j] = 4.;
}
}
std::complex<double> sum(0.);
auto start = std::chrono::system_clock::now();
for(int i=0; i<Nx; ++i){
for(int j=0; j<Ny; ++j){
sum += A[i][j]*B[i][j];
}
}
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "Result: (" << std::real(sum) << "," << std::imag(sum) << ")\n";
std::cout << "Wall clock time(standard): " << elapsed_seconds.count() << "s\n";
sum = 0.;
start = std::chrono::system_clock::now();
for(int i=0; i<Nx; ++i){
for(int j=0; j<Ny; ++j){
sum += std::complex<double>(a[i][j]*c[i][j]-b[i][j]*d[i][j],
b[i][j]*c[i][j]+a[i][j]*d[i][j]);
}
}
end = std::chrono::system_clock::now();
elapsed_seconds = end-start;
std::cout << "Result: (" << std::real(sum) << "," << std::imag(sum) << ")\n";
std::cout << "Wall clock time(explicit): " << elapsed_seconds.count() << "s\n";
for(int i=0;i<Nx;++i){
delete [] A[i];
delete [] B[i];
delete [] a[i];
delete [] b[i];
delete [] c[i];
delete [] d[i];
}
delete [] A;
delete [] B;
delete [] a;
delete [] b;
delete [] c;
delete [] d;
return 0
}
I use the following compiler command lines:
"/opt/pgi/linux86-64/17.7/bin/pgc++" -O3 -fastsse --c++11 [-tp=px] -Minform=warn -o pgi_vs_gcc_pgc++.exe pgi_vs_gcc.C
"/depot/gcc-5.2.0/bin/g++" -std=c++11 -msse -mfpmath=sse -O3 -o pgi_vs_gcc_g++.exe pgi_vs_gcc.C
and get the following timings:
$pgi_vs_gcc_pgc++.exe
Result: (-5e+08,1e+09)
Wall clock time(standard): 1.16136s
Result: (-5e+08,1e+09)
Wall clock time(explicit): 0.311359s
$pgi_vs_gcc_g++.exe
Result: (-5e+08,1e+09)
Wall clock time(standard): 0.436128s
Result: (-5e+08,1e+09)
Wall clock time(explicit): 0.19047s
It also seems that using no O level vs -O3 makes a 10x difference for gcc (actually -O1 is sufficient) whereas for pgi I don’t see any difference between w/ or w/o O level.
Thanks,
LS