PGI FORTRAN OpenMP: poor performance in a big loop???

Hello, All,

I am a PGI FORTRAN (2011) OpenMP user. Currently I am trying to implement parallel computing with OpenMP in my FORTRAN program. If I run the parallel computing isolately, it is faster than the sequential computing. But if this parallel computing is in a big looping (therefore, the multiple threads are created/cloased on each loop), this parallel computing is much slower than the sequential computing. Do you have the similar experience? The big loop outside this parallel computing is needed to keep the computing completed (also, this big loop is almost impossible to be implemented with parallel computing). Any solution for this?

Thank you!
Nick

Hi Nick,

There could be any number of reasons why this is happening. Synchronization issues, memory, coding error, etc. Without an example, we don’t really have anyway of knowing.

What I’d suggest doing is profiling your code using ‘pgcollect’ and then reviewing the resulting profile in PGPROF. (details on how to profile can be found in the PGPROF User’s Guide). This will give a better idea of where the time is being spent. In particular, look for the “mp_barrier” routine. My best guess at this point is that your threads get stuck waiting for each other.

Also, if you could post an basic example code of what you are doing, including the OpenMP directives, that may help.

Best Regards,
Mat

Hi, Mat,

Thank you for your help!

This is the example codes:

CALL omp_set_num_threads(10)

DO J = 1, 1000
!$OMP PARALLEL
!$OMP DO PRIVATE(I)
DO I = 1,10
!$OMP TASK
CALL MY_SUB()
!$OMP END TASK
END DO
!$OMP END DO
!$OMP END PARALLEL
END DO

SUBROUTINE MY_SUB()
CALL SLEEP(0.01)
END MY_SUB


Inside the “DO” loop, the parallel computing become slower than sequential computing.

Thank you again.
Nick

Hi Nick,

Your program is hanging in the SLEEP routine since ti expects an integer but you’re passing in a float. To fix, change “0.01” to “1”.

Hope this helps,
Mat

Hi, Mat,

The current sample may not be well reproduced (I did this in rush at the weekend). I will reproduce this sample codes again when I come back to my office June 11. Thank you!!!

Nick

Dear Mat,

Please try the following FORTRAN codes – the parallel computing is much slower that the sequential computing (just remove those “!$OMP” macros for the sequential computing):

SUBROUTINE OpenMP_EX()
IMPLICIT NONE

INTEGER :: I, J, K
DOUBLE PRECISION :: A

CALL omp_set_num_threads(10)

DO J= 1, 10000
!$OMP PARALLEL
!$OMP DO PRIVATE(I)
DO I = 1,10
!$OMP TASK
DO K = 1, 10**3
A = DEXP(1D0)
END DO
!$OMP END TASK
END DO
!$OMP END DO
!$OMP END PARALLEL
!WRITE(,) "J = ", J
END DO

END SUBROUTINE OpenMP_EX

Hi, Mat,

The following FORTRAN codes is more typical showing that “parallel computing” is even much slower that the “sequential computing”:


SUBROUTINE OpenMP_EX()
IMPLICIT NONE

INTEGER :: I, J, K
DOUBLE PRECISION :: A

CALL omp_set_num_threads(10)

DO J= 1, 1000
!$OMP PARALLEL
!$OMP DO PRIVATE(I)
DO I = 1,10
!$OMP TASK
DO K = 1, 10**5
A = DEXP(1D0)
END DO
!$OMP END TASK
END DO
!$OMP END DO
!$OMP END PARALLEL
WRITE(,) "J = ", J
END DO

END SUBROUTINE OpenMP_EX

Thanks Nick that helped. The problem here is that by default we don’t destroy and then recreate threads, but rather put the threads into an active wait mode (OMP_WAIT_POLICY=ACTIVE) where they actively spin on a barrier waiting to be reused. Almost all of your program’s time is being spent waiting on this barrier.

The time spent in cycles waiting checking the barrier is set via the environment flag “MP_SPIN”. So the fix is to set MP_SPIN to a small value (like 0 for no wait). The caveat being that your CPU utilization would be pegged at 100% for all threads even when your program was not running in parallel.

Hope this helps,
Mat

Dear Mat,

Thank you for your help!

I tried “C>set MP_SPIN=0” to set MP_SPIN to zero and rerun the example codes – but the slow parallel computing remains. Did I do the right way to set MP_SPIN to zero? Nick

Hi, Mat,

I also tried “c:>set OMP_WAIT_POLICY=PASSIVE” (Windoes 7) and rerun the exanple codes – but slow parallel computing remains! Is this correct way to set OMP_WAIT_POLICY to PASSIVE? or I need to restart my PC after this setting? Nick

Hi Nick,

You don’t want PASSIVE since this will put the threads to sleep, making it worse.

While I was using Linux before, I just tested Windows with “set MP_SPIN=0” and it worked as expected. Granted, not as well as Linux but there was a speed-up when going from the default to setting “MP_SPIN”. Note that since this code does very little work, it’s just testing the OpenMP overhead, I wouldn’t expect any parallel speed-up.

  • Mat

Dear Mat,

Thank you for your quick response!

I set MP_SPIN=0 and OMP_WAIT_POLICY=ACTIVE through Windows Control panel and rerun the example codes – the parallel computing is still much slower than the sequential computing. I may need to restart my machine – I will do this …

Thank you again!
Nick

I set MP_SPIN=0 and OMP_WAIT_POLICY=ACTIVE through Windows Control panel

If you still don’t see any change, can you try setting then variables from a PGI DOS Command Window?

Thanks,
Mat

I have tried to set those variables through both control panel and DOS command window (is this the PGI DOS command windows?), but the parallel computing is still much slower than the sequential computing. Currently, I am using “-O2 -fast” options for PGI FORTRAN compiler which make the sequential computing pretty fast, thus comparatively the parallel computing is slower. I will check this when I come back to my office tomorrow. THank you! Nick

Dear Mat,

(1) As you suggested, MP_SPIN is set to 0;
(2) If the example codes is copiled with option “-mp”, the “parallel computing” is faster than “sequential computing”;
(3) If the example codes is compiled with options “-mp -O3 -fast”, the “sequential computing” is much faster than “parallel computing” (with MP_SPIN set to 0);

It seems the options “-O3 -fast” only speedup sequential computing. Therefore, the sequential computing with “-O3 -fast” is better choice compared to parallel computing?

Nick

Hi Nick,

Therefore, the sequential computing with “-O3 -fast” is better choice compared to parallel computing?

No, you can’t conclude this for a general case. I think a more likely scenario here is that the compiler is optimizing away the calculations (the value of A is never used, hence it can be removed). The only thing you’re measuring here is the OpenMP overhead.

For OpenMP performance, I’d recommend looking at some benchmarks such as NAS Parallel Benchmarks or the SPEC OMP2001 suite.

  • Mat

Dear Mat,

Thank you for your quick response!

It seems that, for some coding structures, “-O3 -fast” can optimize the “sequential computing” with better performance compared to “parallel computing”, but this is not general case. I will keep eye on this if there are more cases of this kind.

Thank you again for your help.
Nick

Have you tried it with Schedule(Static)?