I’m having a problem on a Debian cluster with openmpi and PGI compilers. The cluster has Debian 6 installed on it.
Fortran based code seems to end up giving NaN results (reported by end user). We used the LU test from the NAS tests (NPB3.2-MPI) to confirm this behavior and got this output in the verification step:
Verification being performed for class A
Accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 NaN 0.7790210760669E+03 NaN
2 NaN 0.6340276525969E+02 NaN
3 NaN 0.1949924972729E+03 NaN
4 NaN 0.1784530116042E+03 NaN
5 NaN 0.1838476034946E+04 NaN
Comparison of RMS-norms of solution error
FAILURE: 1 NaN 0.2996408568547E+02 NaN
FAILURE: 2 NaN 0.2819457636500E+01 NaN
FAILURE: 3 NaN 0.7347341269877E+01 NaN
FAILURE: 4 NaN 0.6713922568778E+01 NaN
FAILURE: 5 NaN 0.7071531568839E+02 NaN
Comparison of surface integral
FAILURE: NaN 0.2603092560489E+02 NaN
If I build with gcc instead of pgi it works and validates.
Openmpi 1.6.4 was built with CC=pgcc, CXX=pgCC, F77=pgf77, F90=pgf90, CFLAGS="-tp=piledriver-64 -O3", and FFLAGS, CXXFLAGS, and FCFLAGS set the same as CFLAGS. I also tried without specifying -tp=piledriver and using O2 instead of O3. It did not help.
What is going on here? What additional info should I provide to help diagnose this?