profiler inconsistency?

Hi all,

I’m trying to profile a few subroutines within a fairly large f90 code, so I compiled and linked with -Mprof=func, and ran several tests. At the same time within the code I used etime and dtime to have time estimates for the same subroutines and I found two types of inconsistency:

a) for a subroutine which only reads in data, dtime tells me it took 61 s (user+sys), while the profiler, for the same run and subroutine says 359. The call to the subroutine is preceded and followed by dtime.

b) subroutine A has a few simple statements and small do while loop, then two nested do loops from which two subroutines (B and C) are called alternatively. at each pass.
The self times and costs cumulative and per call, of B and C are constants within accuracy, but the parent subroutine A has different self times during different tests: 246 s with 1 print statement, and 509 without!!! everything else being the same.
None of the mentioned subroutines makes system calls, or outputs anything with the exception of the one print statement.

these are all the options I’m using:
-Mvect=cachesize:524288 -Munroll -Mnoframe -O2 -pc 64 -Mfree -Mprof=func

any idea?

thanks a lot!

Hi Giovanni,

I passed this on to our Tools Group but they’re not sure. Can you please send an example and a description on how to recreate the problem?


Hi Mat,

I looked into it, but sending you the whole thing with data and all you need to run would be about 1Gb, and taking that specific piece of code offline would require quite some work.

I was able to determine that part of the inconsistencies were due to hardware problems of the specific node I was using on a red hat linux cluster. Once I switched node results did improve a lot, but the problem b) changed slightly, even after I got rid of the print statements and compiled with no optimization at all:

subroutine A has two nested do loops, for each pass it executes alternatively subroutine B xor C, depending on a dynamic condition and on a second condition determined through a configuration file once per run. If conditions are such that C is never executed the self time of A is 0.5 s. If C is executed (~ 70% of the times), the self time of A is 44 s, and note that C is at least 10 times faster B. Also what is done in A, outside B and C is always the same.

I’m not sure what’s going on,

Hi Giovanni,

A couple of things to note. Adding print statements can effect what optimizations can be performed and might account for some of the difference. Try various optimization levels, "-O1’, “-O2”, “-fast”, and “-fastsse”, to see how your code is effected.

Also, “-Mprof=func” instruments your code which is a fairly intrusive profiling method. You might consider using a hardware counter sampling method instead as described in Chapter 2 of the PGI Tool’s Guide.

  • Mat