profiler inconsistency?

zvanaz · March 10, 2007, 1:35am

Hi all,

I’m trying to profile a few subroutines within a fairly large f90 code, so I compiled and linked with -Mprof=func, and ran several tests. At the same time within the code I used etime and dtime to have time estimates for the same subroutines and I found two types of inconsistency:

a) for a subroutine which only reads in data, dtime tells me it took 61 s (user+sys), while the profiler, for the same run and subroutine says 359. The call to the subroutine is preceded and followed by dtime.

b) subroutine A has a few simple statements and small do while loop, then two nested do loops from which two subroutines (B and C) are called alternatively. at each pass.
The self times and costs cumulative and per call, of B and C are constants within accuracy, but the parent subroutine A has different self times during different tests: 246 s with 1 print statement, and 509 without!!! everything else being the same.
None of the mentioned subroutines makes system calls, or outputs anything with the exception of the one print statement.

these are all the options I’m using:
-Mvect=cachesize:524288 -Munroll -Mnoframe -O2 -pc 64 -Mfree -Mprof=func

any idea?

thanks a lot!
Giovanni

MatColgrove · March 13, 2007, 11:07pm

Hi Giovanni,

I passed this on to our Tools Group but they’re not sure. Can you please send an example and a description on how to recreate the problem?

Thanks,
Mat

zvanaz · March 16, 2007, 1:21am

Hi Mat,

I looked into it, but sending you the whole thing with data and all you need to run would be about 1Gb, and taking that specific piece of code offline would require quite some work.

I was able to determine that part of the inconsistencies were due to hardware problems of the specific node I was using on a red hat linux cluster. Once I switched node results did improve a lot, but the problem b) changed slightly, even after I got rid of the print statements and compiled with no optimization at all:

subroutine A has two nested do loops, for each pass it executes alternatively subroutine B xor C, depending on a dynamic condition and on a second condition determined through a configuration file once per run. If conditions are such that C is never executed the self time of A is 0.5 s. If C is executed (~ 70% of the times), the self time of A is 44 s, and note that C is at least 10 times faster B. Also what is done in A, outside B and C is always the same.

I’m not sure what’s going on,
Thanks!
Giovanni

MatColgrove · March 17, 2007, 3:24pm

Hi Giovanni,

A couple of things to note. Adding print statements can effect what optimizations can be performed and might account for some of the difference. Try various optimization levels, "-O1’, “-O2”, “-fast”, and “-fastsse”, to see how your code is effected.

Also, “-Mprof=func” instruments your code which is a fairly intrusive profiling method. You might consider using a hardware counter sampling method instead as described in Chapter 2 of the PGI Tool’s Guide.

Mat

Topic		Replies	Views
Problems profiling Legacy PGI Compilers	1	9483	March 22, 2007
profiling within one subroutine Legacy PGI Compilers	2	9469	July 3, 2008
gprof vs pgprof Legacy PGI Compilers	3	13444	September 19, 2006
Determining running timing problem Legacy PGI Compilers	16	6866	May 10, 2012
Very slow performance of some loops Legacy PGI Compilers	3	2809	July 22, 2011
Routine has too high performance cost Legacy PGI Compilers	3	2436	September 17, 2016
Funny optimization behavior? Legacy PGI Compilers	2	5510	August 18, 2005
pgprof/pgcollect : problem with CPU+openacc on same routine Legacy PGI Compilers	2	7259	November 24, 2014
problem with openMP, PGI 18.10 Legacy PGI Compilers	6	1095	November 13, 2019
Easy to understand profiling information Legacy PGI Compilers	1	9688	August 3, 2005

profiler inconsistency?

Related topics