openMP example programs not working as expected

In the manual “PGI® User’s Guide -Parallel Fortran, C and C++ for Scientists and Engineers” there is an example program on pp. 136-137 called “CRITICAL_USE”.

I compile this using
pgf95 -mp critical_use.F -o critical_use
and change the number of threads available using
export OMP_NUM_THREADS=
and vary the number of threads from 1 to 8. I have pgf90 6.1-1 64-bit target on x86-64 Linux, there are 8 cores available per node, sharing 20 GB of memory per node.

I do not see any speedup when varying number of processors. There is another example program in the User’s Guide called “VECTOR_OP” on p. 28 [example 2-3]. I can get a speedup in this program using the vector compile option versus the scalar version, but I cannot get any speedup in this program when using the “-Mconcur -fastsse” options and try to vary the number of OMP_NUM_THREADS.

I did find another PGI openmp example at
http://www.pgroup.com/openmpbench_dir/fftpde/
and this works as expected, i.e. I do get a speedup as I vary OMP_NUM_THREADS.

So my question really goes back to the top of this page, why am I not seeing a speedup in the “ciritical_use.F” program? Could someone else out there try it and see what they get?

This is important because I’ve got another program I’m working on where I am not seeing any speedup using openMP directives, and it’s fairly straigtforward where I just have a bunch of very trivial loops to parallelize.

Thanks in advance.

Hi haferman,

So my question really goes back to the top of this page, why am I not seeing a speedup in the “ciritical_use.F” program? Could someone else out there try it and see what they get?

How are you measuring the runtime? The example not meant to show parallel speed-up, rather how critical sections can be used.

  • Mat

I’m simply measuring wall clock time using the “time” command.

I ask because, for me at least, the example takes less than a second, even with 1 thread. Have you modified it so that it will run longer?

  • Mat

Takes about 3 seconds for me independent of the number of threads. I put another loop (k = 1,10) outside of the !OMP directives, and takes 24 seconds with 1 thread, speeds up to 20 seconds with 4 threads… doesn’t give me the warm fuzzies that the work is being split up efficiently…

On the other hand, the example at http://www.pgroup.com/openmpbench_dir/fftpde/
takes about 3 seconds with 1 thread, 2 seconds with 2 threads, 1.5 seconds with 3 threads, 1 second with 4 threads, so even though it runs fast to begin with, the speedup is obvious…