Errors when building with PGI compiler

Hello,
I am using PGI Compiler 11.10 (pgfortran and pgcc) to build CFS (weather forecasting tool) on Intel x86, 64 bit Linux platform (CentOS 5.6). The idea is to use PGI Accelerator directives for Nvidia GPUs in this code. Earlier this CFS code has been compiled with Intel compiler (ifort and icc) and Intel MPI library (4.0.1) and it executes successfully.

I have compiled the complete code using PGI compiler (pgfortran and pgcc) and Intel MPI library (4.0.1).

I am facing the problems during the execution of PGI compiled binaries. The execution exits very early in with the following error:

+++++++++++++++++++++++++++++


!!! Error in subr radiation_aerosols: unrealistic surface pressure = NaN
!!! Error in subr radiation_aerosols: unrealistic surface pressure = NaN
!!! Error in subr radiation_aerosols: unrealistic surface pressure = NaN

rank 34 in job 1 GPUBlade_44942 caused collective abort of all ranks
exit status of rank 34: killed by signal 9
++++++++++++++++++++++++++++++

In other words the type of error tells me that values of some variables are invalid i.e. zero, negative, etc. which causes warnings as given below (“divide by zero”, etc.):

FORTRAN STOP
Warning: ieee_invalid is signaling
Warning: ieee_divide_by_zero is signaling
Warning: ieee_underflow is signaling
Warning: ieee_inexact is signalling

This could have happened because the PGI compiler is not generating the variables of appropriate size and/or precision or there could be implicit typecasting by PGI compiler.

Now the question: when changing the compile/linking to PGI from Intel, I substituted compiler options which were as close as possible to what was there in Intel. But still there were many options which didn’t find an equivalent in PGI compiler. For example:
-fpconstant
-Zp16
-auto
-parallel

Could you suggest what options I might be missing or using wrongly in PGI compilers (pgfortran and pgcc) ?

Thanks,
Nikhil

Hi Nikhil,

The first thing to try is an MPI library other then Intel’s version, such as OpenMPI or the MPICH library that ships with the PGI Compilers. Intel MPI doesn’t support PGI so I don’t know what issues could occur. Granted, I doubt that this is the main cause of the errors, but I would take it out of the equation.

As for the Intel flags, “-fpconst” says to promote single precision constants to double. Our closest option for this would be “-r8” where the default kind of reals, including constants, is promoted to double.

“-Zp16” manages the padding in structs and Fortran User Defined Types. My guess is that it’s used for C to Fortran compatibility. Unfortunately, we do not have an equivalent flag. Though if you are using both pgcc and pgfortran, then it shouldn’t matter since our data representation of structs matches types. The only issue might be if your C code makes some assumptions about a struct’s data layout and directly accesses bytes within a struct. If so, then this is a portability issue that will need to be fixed in your C code.

“-auto” says to store all local variables on the stack, including static variables. While we don’t have a directly equivalent flag, as part of our OpenMP flag, “-mp”, all locals are stored on the stack. So using “-mp” will have the same effect.

“-parallel” is Intel’s auto-parallelization flag. Our equivalent is “-Mconcur=allcores”. Though, I’m not sure why an MPI code would need auto-parallelization.

Have you double checked that the data you’re reading is correct? You had sent a note to PGI customer service about unformatted sequential access data. The way the Intel compiler will layout this data can be different from other compilers, so reading in an data file produced using a program compiled by Intel, may be giving you bad input values.

  • Mat

Hi Matt,
Thanks for your response.
A couple of things:

  1. The earlier problem about reading data from a file was resolved after I used “-byteswapio” option in the relevant application code where the particular “read” function is getting invoked. The input file was generated on a big-endian machine (SGI) and it is provided “as is” to us. NOTE: I didn’t have to use any special option when using Intel compiler ifort.
  2. I plan to use OpenMPI with PGI a little later after the current problem is resolved. I am simply delaying that step until the current problem is resolved. In my opinion the current issue is related to the warnings that I am seeing as below:
    +++++++++++++++++++++++++++++++++++
    Warning: ieee_invalid is signaling
    Warning: ieee_divide_by_zero is signaling
    Warning: ieee_underflow is signaling
    Warning: ieee_inexact is signaling
    FORTRAN STOP
    ++++++++++++++++++++++++++++++++++++
    NOTE: These warnings are generated during the execution of CFS binaries.
  3. I am using “-r8” wherever required.
  4. Additionally I tried “-Kieee” today. But it didn’t have any effect.
  5. Additionally I had tried “-i8” also along with “-r8” but no change.
  6. I am also using “-pc 64”. But without this option I had the same issues.
  7. Can ignoring “openmp” pragmas by using “-Mnoopenmp” cause any issues ? In one particular file (ref below), PGI compiler complains about the syntax of openmp pragma and exits with error:
    ++++++++++++++++++++++++++++++++++++
    PGF90-S-0155-jbasev may not appear in a PRIVATE clause (filtr1eo.f: 66)
    PGF90-S-0155-jbasod may not appear in a PRIVATE clause (filtr1eo.f: 66)
    PGF90-S-0155-jbasev may not appear in a PRIVATE clause (filtr1eo.f: 130)
    PGF90-S-0155-jbasod may not appear in a PRIVATE clause (filtr1eo.f: 130)
    ++++++++++++++++++++++++++++++++++++
    At both these line numbers in file filtr1eo.f, following omp pragma is used:
    ++++++++++++++++
    !$omp parallel do private(k,jbasev,jbasod,locl,l,indev,indod)
    ++++++++++++++++
    I used “-Mnoopenmp” in pgfortran to bypass this error.

Let me know your suggestions.

Thanks,
Nikhil

Want to add following extra info to my post:
I am using Intel Math Kernel Lib during the building of CFS code with PGI Fortran/C compiler.
Question: Could there be an issue using Intel MKL with PGI ?

Hi Nikhil,

  1. The earlier problem about reading data from a file was resolved after I used “-byteswapio”

Great.

Warning: ieee_invalid is signaling

Try adding the flag “-Ktrap=fp”. This will cause the run time to trap these errors and abort when they occur. Running the program in a debugger should show where they are occurring. Also, you may try using Valgrind (www.valgrind.org) to see if there is any uninitialized memory. It’s possible that the programmer assumed memory is initialized to zero or forgot to initialize some variables before use. Another flag to try is “-Msave”. This flag essentially adds the “SAVE” attribute to all local variables but has the side effect of initializing them to zero.

  1. I am also using “-pc 64”. But without this option I had the same issues.

Precision control (-pc) is for the x87 FP processor so has no effect here.

  1. Can ignoring “openmp” pragmas by using “-Mnoopenmp” cause any issues ?

It shouldn’t, though I would simply remove “-mp” instead of using “-Mnoopenmp”. The only thing I can think of that might effect results is that static local variables are placed on the stack when OpenMP is used. If there are bugs in the code (like uninitialized memory) this may cause different behavior.

PGF90-S-0155-jbasev may not appear in a PRIVATE clause (filtr1eo.f: 66)

Can you post some of the code, particularly how jbasev and jbasod are declared?

Could there be an issue using Intel MKL with PGI ?

Doubtful, but you can try using ACML instead to double check (-lacml).

  • Mat

Hello Matt,
I added a debugging check “-Mchkptr” and “-Mchkfpstk” while compiling my CFS code. When I run CFS binary, i get this error:

Null pointer for gis%trie_ls (GFS_Initialize_ESMFMod.f: 640)

I am puzzled as to what differences are there PGI vs. Intel compiler ? The same CFS code executes just fine when built with Intel Compiler.
Given that CFS is a huge and complex code, I am clueless at this point about the source of this problem.

BTW, I used -Msave option too wherever I could. In a couple of files I was getting following error when using “-Msave” option:
+++++++++++++++++++++++++++++++++++++++++
gfsio_module.o: In function gfsio_module_gfsio_setgrbtbl_': /home/nikhil/cfs_v2/sorc/cfs_global_atmos.fd/./gfsio_module.f:2265: undefined reference to .STATICS35’
/home/nikhil/cfs_v2/sorc/cfs_global_atmos.fd/./gfsio_module.f:2265: undefined reference to .STATICS35' /home/nikhil/cfs_v2/sorc/cfs_global_atmos.fd/./gfsio_module.f:2265: undefined reference to .STATICS35’
/home/nikhil/cfs_v2/sorc/cfs_global_atmos.fd/./gfsio_module.f:2265: undefined reference to .STATICS35' /home/nikhil/cfs_v2/sorc/cfs_global_atmos.fd/./gfsio_module.f:2265: undefined reference to .STATICS35’
gfsio_module.o:/home/nikhil/cfs_v2/sorc/cfs_global_atmos.fd/./gfsio_module.f:2265: more undefined references to `.STATICS35’ follow
+++++++++++++++++++++++++++++++++++++++++++
Therefore in those files I didn’t use this option. Could this also be an issue ?

  • Nikhil

Null pointer for gis%trie_ls (GFS_Initialize_ESMFMod.f: 640)
I am puzzled as to what differences are there PGI vs. Intel compiler ? The same CFS code executes just fine when built with Intel Compiler.

It may not be the compiler. If the code contains errors such as accessing null pointers, non-deterministic behavior may occur. In other words, while the code may “work” with Intel, it may be just be because of luck.

Can you determine and fix the Null pointer error? Does Valgrind show anything? Can you compile with Intel’s diagnostic flags? What happens with other compilers, such as gfortran?

As for the STATICS error, I’ll let PGI Customer Service investigate since it looks like you already sent them this.

  • Mat

Hello Mat,
We fixed the NULL pointer error and now we don’t get any such error message. But the application execution still terminates as before without giving any error or debug message. I have tried using various combinations of PGI compiler options including “-Ktrap=fp, -Mchkptr, -Mchkfpstk, -Mfprelaxed -Msave -Mnodaz -Mnoflushz”. Still it does not generate any debug, error or warning messages.
I didn’t get a chance to try with Valgrind.

With Intel compiler using “-check all” option, I could see the same error message during execution of the binary. But as I mentioned earlier, if I don’t add this option, the Intel compiled binary executes successfully.

Haven’t tried with gfortran.
Thanks,
Nikhil

Hello Mat,
You had requested the code fragment where pgfortran gives error in OpenMP directives. Below is the code fragment. Note the openmp pragma which gives error:
PGF90-S-0155-jbasev may not appear in a PRIVATE clause (filtr1eo.f: 66)
++++++++++++++++++++++++++++++++++
integer k,l,locl,n
cc
integer indev
integer indod
integer indev1,indev2
integer indod1,indod2
real(kind=kind_evod) filtb
cc
real(kind=kind_evod) cons0p5,cons1 !constant
cc
integer indlsev,jbasev
integer indlsod,jbasod
cc
include ‘function2’
cc
cc
CALL countperf(0,13,0.)
cons0p5 = 0.5d0 !constant
cons1 = 1.d0 !constant
cc
cc
filtb = (cons1-filta)*cons0p5 !constant
cc
cc
!$omp parallel do private(k,jbasev,jbasod,locl,l,indev,indod)
!$omp+private(indev1,indev2,indod1,indod2)
do k=1,levs
cc
do locl=1,ls_max_node
l=ls_node(locl,1)
jbasev=ls_node(locl,2)
indev1 = jbasev+(L-L)/2+1
if (mod(L,2).eq.mod(jcap+1,2)) then
indev2 = jbasev+(jcap+1-L)/2+1
else
indev2 = jbasev+(jcap -L)/2+1
endif
do indev = indev1 , indev2
cc
+++++++++++++++++++++++++++++++++++++++++
Thanks,
Nikhil

Hi Nikhil,

I tried but was unable to recreate the error with the code snipit. Can you try can create a complete example that when compiled produces the syntax error? If it is too large or contains proprietary code, please send the source to PGI Customer Service (trs@prgroup.com) and ask them to forward it to me.

Thanks,
Mat

% cat test.f
	integer k,l,locl,n
!cc
	integer indev
	integer indod
	integer indev1,indev2
	integer indod1,indod2
	real(kind=8) filtb
!cc
	real(kind=8) cons0p5,cons1 !constant
!cc
	integer indlsev,jbasev
	integer indlsod,jbasod
!cc
!include 'function2'
!cc
!cc
	CALL countperf(0,13,0.)
	cons0p5 = 0.5d0 !constant
	cons1 = 1.d0 !constant
!cc
!cc
	filtb = (cons1-filta)*cons0p5 !constant
!!cc
!!cc
!$omp parallel do private(k,jbasev,jbasod,locl,l,indev,indod) 
!$omp+ private(indev1,indev2,indod1,indod2)
	do k=1,levs
!!cc
	  do locl=1,ls_max_node
	     l=ls_node(locl,1)
	     jbasev=ls_node(locl,2)
	     indev1 = jbasev+(L-L)/2+1
	     if (mod(L,2).eq.mod(jcap+1,2)) then
	       indev2 = jbasev+(jcap+1-L)/2+1
	     else
	       indev2 = jbasev+(jcap -L)/2+1
	     endif
	     do indev = indev1 , indev2
	     enddo
	  enddo
	enddo

	end

% pgf90 test.f -c -mp -Minfo=mp
MAIN:
     25, Parallel region activated
     27, Parallel loop activated with static block schedule
     41, Parallel region terminated

Hi Mat,
On further investigation, I was able to isolate the OpenMP problem to the two files. I have emailed those files to trs (Dave) and requested to be forwarded to you.
I get the OpenMP compilation error in these two files only:

filtr1eo.f
filtr2eo.f

Thanks,
Nikhil