NaN's in code that worked with Intel compiler

Jon_Slavin · March 29, 2011, 6:56pm

To all:

I’m new to using pgfortran. Up to recently I’ve been using the Intel fortran compiler, but am porting the code to run on our cluster which doesn’t have that compiler. The code I’m compiling has been working perfectly on a very similar system (Linux CentOS 64 bit), but when I compile with pgfortran and run it, I get NaN’s.

I found a posting about detecting NaN’s and have used the code listed to find when a certain variable is NaN. The oddest thing is that when I print the values of two other vairables that are summed to give the value of the variable, each of them are ordinary reasonable numbers. That makes me think that there must be some sort of memory error. Any advice on what could be the cause of this and how to diagnose the problem would be appreciated. As far as I can tell pgfortran doesn’t offer much in the way of runtime checking, but if I’m mistaken about that, I’d like to know how to enable it.

Thanks,
Jon

MatColgrove · March 29, 2011, 8:15pm

Hi Jon,

The flag “-Ktrap” will detect at runtime IEEE trap conditions such as divide by zero, underflow, overflow, etc, that may be the cause of the NaN. (Please see 'pgfortran -help -Ktrap" for the sub-options or the PGI User’s Guide).

For memory issues, I usually recommend the Valgrind utility (http://www.valgrind.org).

What options are you using for both the PGI and Intel builds? Does the error occur at low optimization (-g)? Is your application multi-threaded (OpenMP or MPI)?

Mat

Jon_Slavin · March 30, 2011, 1:53pm

Hi Mat,

Thanks for the reply. I’ll try using -Ktrap and may request that valgrind be installed on the cluster.

I have been using -C -g when compiling with pgfortran (but optimized when compiling with Intel ifort). The code does use MPI, though I created a non-MPI version for debugging purposes and am still getting the NaN’s.

I should mention that the code is mixed F77 and F90 (i.e. some subroutines/functions are F77 others are F90) in case that makes any difference.

Thanks,
Jon

MatColgrove · March 30, 2011, 4:06pm

Hi Jon,

Since the problem occurs with “-g”, it’s more likely a problem in the code.

One culprit may be an uninitialized variable. At low optimization, variables are loaded and stored from memory while at higher optimization, values are more likely to be stored in registers. If the variable happens to be stored in a register, it may have a consistent valid initial value, while if loaded from memory may contain garbage. Valgrind will be able to find uninitialized memory reads (UMR).

Other causes may be that that an optimization reorders the code so the program error is masked. In this case, try compiling at full optimization and see if the problem goes away. If it does, then uses a binary search method where you compile half the source file with optimization and the other half without. Continue until you find exactly which file causes the NaN. From there, you can use the “opt” directive to selectively optimize each routine in the file. Continue until you’ve narrowed it down the exact routine. Finally, run the code in the debugger (PGDBG) and step through the routine to determine the exact cause of the NaN.

Hope this helps,
Mat