invalid fortran results - NaN - openmpi 1.6.4 & pgi13.6

Hi all,

I’m having a problem on a Debian cluster with openmpi and PGI compilers. The cluster has Debian 6 installed on it.

Fortran based code seems to end up giving NaN results (reported by end user). We used the LU test from the NAS tests (NPB3.2-MPI) to confirm this behavior and got this output in the verification step:

Verification being performed for class A
Accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 NaN 0.7790210760669E+03 NaN
2 NaN 0.6340276525969E+02 NaN
3 NaN 0.1949924972729E+03 NaN
4 NaN 0.1784530116042E+03 NaN
5 NaN 0.1838476034946E+04 NaN
Comparison of RMS-norms of solution error
FAILURE: 1 NaN 0.2996408568547E+02 NaN
FAILURE: 2 NaN 0.2819457636500E+01 NaN
FAILURE: 3 NaN 0.7347341269877E+01 NaN
FAILURE: 4 NaN 0.6713922568778E+01 NaN
FAILURE: 5 NaN 0.7071531568839E+02 NaN
Comparison of surface integral
FAILURE: NaN 0.2603092560489E+02 NaN
Verification failed


If I build with gcc instead of pgi it works and validates.

Openmpi 1.6.4 was built with CC=pgcc, CXX=pgCC, F77=pgf77, F90=pgf90, CFLAGS="-tp=piledriver-64 -O3", and FFLAGS, CXXFLAGS, and FCFLAGS set the same as CFLAGS. I also tried without specifying -tp=piledriver and using O2 instead of O3. It did not help.

What is going on here? What additional info should I provide to help diagnose this?

Thanks,
Rick

Hi Rick,

Hmm, I just ran LU with CLASS A using NPB3.3-MPI last week without issue. Granted, I was using a different set-up than you. While, I’m out of time for today, I’ll run it again tomorrow on a Piledriver system using OpenMPI and see if I can recreate your error.

If you could try compiling at “-O0”, I’d apprecitate it. Also, what compiler version are using, how many MPI processes are you using, and did you set number of process at build?

Thanks,
Mat

Hi,

I’m using the latest 13.6 version of the compilers. The system originally had 13.2 but I updated to 13.6 to try resolving the problem before posting this.

Recompiling MPI and LU using -O0 didn’t help anything.

I then tried using the open MP version of the NAS tests instead of the MPI version. I had the same NaN results:

microway@master:~/NPB3.2/NPB3.2-OMP$ make LU CLASS=A

= NAS PARALLEL BENCHMARKS 3.2 =
= OpenMP Versions =
= F77/C =

cd LU; make CLASS=A
make[1]: Entering directory /home/microway/NPB3.2/NPB3.2-OMP/LU' make[2]: Entering directory /home/microway/NPB3.2/NPB3.2-OMP/sys’
cc -o setparams setparams.c
make[2]: Leaving directory /home/microway/NPB3.2/NPB3.2-OMP/sys' ../sys/setparams lu A pgf77 -c -O0 lu.f pgf77 -c -O0 read_input.f pgf77 -c -O0 domain.f pgf77 -c -O0 setcoeff.f pgf77 -c -O0 setbv.f pgf77 -c -O0 exact.f pgf77 -c -O0 setiv.f pgf77 -c -O0 erhs.f pgf77 -c -O0 ssor.f pgf77 -c -O0 rhs.f pgf77 -c -O0 l2norm.f pgf77 -c -O0 jacld.f pgf77 -c -O0 blts.f pgf77 -c -O0 jacu.f pgf77 -c -O0 buts.f pgf77 -c -O0 error.f pgf77 -c -O0 pintgr.f pgf77 -c -O0 verify.f cd ../common; pgf77 -c -O0 print_results.f cd ../common; pgf77 -c -O0 timers.f cd ../common; pgcc -c -O -o wtime.o ../common/wtime.c pgf77 -O -o ../bin/lu.A lu.o read_input.o domain.o setcoeff.o setbv.o exact.o setiv.o erhs.o ssor.o rhs.o l2norm.o jacld.o blts.o jacu.o buts.o error.o pintgr.o verify.o ../common/print_results.o ../common/timers.o ../common/wtime.o make[1]: Leaving directory /home/microway/NPB3.2/NPB3.2-OMP/LU’
microway@master:~/NPB3.2/NPB3.2-OMP$ cd bin/
microway@master:~/NPB3.2/NPB3.2-OMP/bin$ ./lu.A


NAS Parallel Benchmarks (NPB3.2-OMP) - LU Benchmark

Size: 64x 64x 64
Iterations: 250

Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Time step 220
Time step 240
Time step 250

Verification being performed for class A
Accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
FAILURE: 1 NaN 0.7790210760669E+03 NaN
FAILURE: 2 NaN 0.6340276525969E+02 NaN
FAILURE: 3 NaN 0.1949924972729E+03 NaN
FAILURE: 4 NaN 0.1784530116042E+03 NaN
FAILURE: 5 NaN 0.1838476034946E+04 NaN
Comparison of RMS-norms of solution error
FAILURE: 1 NaN 0.2996408568547E+02 NaN
FAILURE: 2 NaN 0.2819457636500E+01 NaN
FAILURE: 3 NaN 0.7347341269877E+01 NaN
FAILURE: 4 NaN 0.6713922568778E+01 NaN
FAILURE: 5 NaN 0.7071531568839E+02 NaN
Comparison of surface integral
FAILURE: NaN 0.2603092560489E+02 NaN
Verification failed

This is a simpler case to reproduce I think.

Hi Rick,

This is going to be a tough one. I’ve tried my best but I can’t get it to fail. I went back to NPB 3.2 and used PGI pgf77 v13.6 on a Piledriver system, but it works fine. The fact that it fails with no optimization, leads me to believe that something else is going on rather than a compiler issue. What? I’m not sure.

Have you made any changes to the source? Can you try running the same experiment on a different system?

  • Mat
piledriver:/tmp/qa/NPB3.2/NPB3.2-OMP% make CLASS=A LU
   ============================================
   =      NAS PARALLEL BENCHMARKS 3.2         =
   =      OpenMP Versions                     =
   =      F77/C                               =
   ============================================

cd LU; make CLASS=A
make[1]: Entering directory `/opt/tmp/qa/NPB3.2/NPB3.2-OMP/LU'
make[2]: Entering directory `/opt/tmp/qa/NPB3.2/NPB3.2-OMP/sys'
cc  -o setparams setparams.c
make[2]: Leaving directory `/opt/tmp/qa/NPB3.2/NPB3.2-OMP/sys'
../sys/setparams lu A
pgf77 -c  -O0 lu.f
pgf77 -c  -O0 read_input.f
pgf77 -c  -O0 domain.f
pgf77 -c  -O0 setcoeff.f
pgf77 -c  -O0 setbv.f
pgf77 -c  -O0 exact.f
pgf77 -c  -O0 setiv.f
pgf77 -c  -O0 erhs.f
pgf77 -c  -O0 ssor.f
pgf77 -c  -O0 rhs.f
pgf77 -c  -O0 l2norm.f
pgf77 -c  -O0 jacld.f
pgf77 -c  -O0 blts.f
pgf77 -c  -O0 jacu.f
pgf77 -c  -O0 buts.f
pgf77 -c  -O0 error.f
pgf77 -c  -O0 pintgr.f
pgf77 -c  -O0 verify.f
cd ../common; pgf77 -c  -O0 print_results.f
cd ../common; pgf77 -c  -O0 timers.f
cd ../common; pgcc  -c  -O  -o wtime.o ../common/wtime.c
pgf77 -O -o ../bin/lu.A lu.o read_input.o domain.o setcoeff.o setbv.o exact.o setiv.o erhs.o ssor.o rhs.o l2norm.o jacld.o blts.o jacu.o buts.o error.o pintgr.o verify.o ../common/print_results.o ../common/timers.o ../common/wtime.o
make[1]: Leaving directory `/opt/tmp/qa/NPB3.2/NPB3.2-OMP/LU'
piledriver:/tmp/qa/NPB3.2/NPB3.2-OMP% bin/lu.A


 NAS Parallel Benchmarks (NPB3.2-OMP) - LU Benchmark

 Size:  64x 64x 64
 Iterations:                    250

 Time step    1
 Time step   20
 Time step   40
 Time step   60
 Time step   80
 Time step  100
 Time step  120
 Time step  140
 Time step  160
 Time step  180
 Time step  200
 Time step  220
 Time step  240
 Time step  250

 Verification being performed for class A
 Accuracy setting for epsilon =  0.1000000000000E-07
 Comparison of RMS-norms of residual
           1   0.7790210760669E+03 0.7790210760669E+03 0.5837420383828E-15
           2   0.6340276525969E+02 0.6340276525969E+02 0.2801702468535E-14
           3   0.1949924972729E+03 0.1949924972729E+03 0.1166063713339E-14
           4   0.1784530116042E+03 0.1784530116042E+03 0.1274137507679E-14
           5   0.1838476034946E+04 0.1838476034946E+04 0.4947003303197E-15
 Comparison of RMS-norms of solution error
           1   0.2996408568547E+02 0.2996408568547E+02 0.0000000000000E+00
           2   0.2819457636500E+01 0.2819457636500E+01 0.1575087364679E-15
           3   0.7347341269877E+01 0.7347341269877E+01 0.3626529871458E-15
           4   0.6713922568778E+01 0.6713922568778E+01 0.1322890472152E-15
           5   0.7071531568839E+02 0.7071531568839E+02 0.2009586548100E-15
 Comparison of surface integral
               0.2603092560489E+02 0.2603092560489E+02 0.1364804975715E-15
 Verification Successful


 LU Benchmark Completed.
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                   136.03
 Total threads   =                        1
 Avail threads   =                        1
 Mop/s total     =                   876.96
 Mop/s/thread    =                   876.96
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                      3.2
 Compile date    =              14 Jun 2013

 Compile options:
    F77          = pgf77
    FLINK        = $(F77)
    F_LIB        = (none)
    F_INC        = (none)
    FFLAGS       = -O0
    FLINKFLAGS   = -O
    RAND         = (none)


 Please send all errors/feedbacks to:

 NPB Development Team
 npb@nas.nasa.gov

Hi Mat,

I just finished setting up another Opteron system with Debian 6 (squeeze) and PGI 13.6. It has the same NaN results on the NAS LU OMP test.

Would you like me to provide remote login access to this system for you to check?

Thanks,
Rick

Would you like me to provide remote login access to this system for you to check?

Sure, if it’s easy to do. Though, I probably wont something I wont be able to get to till next week. Let me send you an email and we can coordinate.

  • Mat

OK, Thanks.

My email is rick at microway dot com.

Thanks,
Rick

Thanks for logging in and identifying the problem Mat.

The issue was with fma instructions, and a buggy assembler version on Debian 6. Based on Mat’s suggestion I resolved the problem by updating the system to binutils 2.22-8, recompiled from Debian 7’s source package.