CAM and PGI 7.1-5

We’re running cam3-1.1.p1 (8 OpenMP threads over OpenMPI 1.2.5)on a 10 nodes of 8 core Intel/Linux systems and I found a “memory leak” consuming about 140MB per hour on each compute node. I ran valgrind with full checks over the cam process and it didn’t find any faults with cam, just quite a few errors when it was opening libs at the very beginning.

For parts of cam, the default optimization level is -O (O2?) which is where we are finding the leak. When I tested with -O1, there was a memory leak as well, or at least I think I tested with -O1.

Build script options -

If an executable doesn’t exist, build one.

if ( ! -x $blddir/cam ) then
cd $blddir || echo “cd $blddir failed” && exit 1
$cfgdir/configure -spmd -smp -fopt ‘-O1’ -nc_lib /soft/local/netcdf/netcdf-3.6.2/lib
-nc_inc /soft/local/netcdf/netcdf-3.6.2/include
-cc pgcc -fc pgf90 -res 128x256
-mpi_inc /soft/local/openmpi/openmpi-1.2.5/include
-mpi_lib /soft/local/openmpi/openmpi-1.2.5/lib


Any known issues with this setup? My experience of issues with higher optimization levels are segmentation faults, but never a memory leak.

Hi Elvedin,

Although I have not seen a situation where a particular optimization causes a memory leak, it is a possibility. I have CAM3-1.1.p1 and valgrind here and will look into it today or Monday.

Thanks for the report,
Mat

Thanks, I’ll keep looking into it as well.

Hi Elvedin,

I’ve spent a few days looking at this and think I have and understanding of what you’re seeing. For reference, I used PGI 7.1-5 and CAM v3.0 which is slightly different then your version. I build CAM using “-g”, “-O2 -gopt” and “-fast -gopt” and then compared the resulting valgrind outputs.

At “-g”, there are no compiler optimization so the few reported errors were strictly from CAM. It appears that there are a few uninitialized variables, but no huge problems. At “-O2”, I saw little difference in the valgrind logs versus “-g” and valgrind did not report any memory leaks. However at “-fast”, I saw thousands valgrind errors which caused it to abort early since it stops reporting once the number of errors get too high.

Example Valgrind error when CAM is compiled at “-fast”:

==2851== Conditional jump or move depends on uninitialised value(s)
==2851==    at 0x482A3E: cldnrh_ (/tmp/cam1/models/atm/cam/src/physics/cam1/cldnrh.F90:101)
==2851==    by 0x1CCBB587: ???
==2851==    by 0x44A429F: ???
==2851==    by 0x44A441F: ???
==2851==    by 0xD267DF: ???
==2851==    by 0x44BE55F: ???
==2851==    by 0x1CCC0407: ???
==2851==    by 0x44A15DF: ???
==2851==    by 0x44A459F: ???
==2851==    by 0xC0487EF: ???
==2851==    by 0xBFD85BF: ???
==2851==    by 0xD2AA1F: ???

As you can see, Valgrind is fairly confused by the optimized code. I think it’s unable to follow where the compiler has stored variables in registers thus printing out thousands of the uninitialized conditional jump messages.

At this point I’m more inclined to believe that the errors you’re seeing are due to Valgrind’s reading of the optimized code rather than the compiler creating a memory leak at high optimization. Of course though, I didn’t repeat your exact experiment so please let me know if your interpretation is different and I can pursue the issue further.

  • Mat

Valgrind found no faults with CAM, it’s just that memory increase (>=140MB per hour) we’re seeing every hour. Under -g debugging, we’re getting no such memory increase. You should be able to get our setup through the default optimizations.