Interpreting PGPROF "-Mprof=lines" output

Hello,

I’m trying to profile some code to see why it doesn’t scale well with OpenMP, and I can’t figure out what PGPROF is telling me.

Compiled without -mp but with -Mprof=lines, the three most time-consuming parts of my code are “_roupri”, “__linent2”, and “_rouret”. What do those mean, and given that they take up ~50% of my serial code’s runtime is there anything I can do to shorten that? Also, if I try double-clicking on the entries in PGPROF I get taken to assembler code rather than a specific line(s) in my Fortran code; how can I see which lines of my code are the source of the time sinks? As a bonus question, is “_mp_get_tcpus” related to OpenMP, and why would it be called regularly from a serial code?

Compiled with -mp and -Mprof=lines, the three most demanding entries in PGPROF are “__linent2”, “_roupri”, and “mp_ecs”. I’ve already asked about the first two, but what’s that last one?

If I forget about -Mprof=lines and just compile with -mp and -Minfo=ccff, the two most time-consuming entries (accounting for 74% of the CPU time!!) are “mp_barrier” and “mp_barrierw”. I’d love to know which critical section or atomic addition is the root of this delay so I could program around it, but once again double clicking on either of these takes me to assembler code. How can I determine which part of the OpenMP code is causing the delays?

Many thanks in advance.

I’m bumping this just to make sure it wasn’t forgotten.

Compiled without -mp but with -Mprof=lines, the three most time-consuming parts of my code are “_roupri”, “__linent2”, and “_rouret”. What do those mean, and given that they take up ~50% of my serial code’s runtime is there anything I can do to shorten that?

These are the routines that are used to instrument the code for profiling. They do add overhead but 50% seems excessive, unless your overall runtime is very small and/or you have a large number of executed lines of source. You might try removing “-Mprof” and instead use pgcollect. It uses hardware based sampling to profile the code so is less accurate for little used sections of code, but is fine for the hot spots.

PGPROF I get taken to assembler code rather than a specific line(s) in my Fortran code; how can I see which lines of my code are the source of the time sinks?

Hmm, with -Mprof=lines the only time it should show you assembly is when you click on the source line or if you click on a library. Though, if the code is optimized, the lines may get moved around or consolidated. You can try dialling down optimization and adding “-gopt” to have more debugging information available.

As a bonus question, is “_mp_get_tcpus” related to OpenMP, and why would it be called regularly from a serial code?

That routine gets the maximum number of threads available. For serial code, we link with our OpenMP runtime to enable process binding. " _mp_get_tcpus" is called from the OpenMP initialization routine, but it should only be called a few times.

Compiled with -mp and -Mprof=lines, the three most demanding entries in PGPROF are “__linent2”, “_roupri”, and “mp_ecs”. I’ve already asked about the first two, but what’s that last one?

The last one is a single line of assembly code which tests a semaphore. You can adjust the environment variable “MP_SPIN” to a lower value so that the threads enter a sleep state instead of spinning.

If I forget about -Mprof=lines and just compile with -mp and -Minfo=ccff, the two most time-consuming entries (accounting for 74% of the CPU time!!) are “mp_barrier” and “mp_barrierw”.

We create the OpenMP threads at the start of your program and put them into a wait state until you enter an OpenMP parallel region. So most likely these threads are just waiting for the master thread and are not causing a delay.

  • Mat