Fail to launch OpenMP in PVF

Hi everyone,

I use the following source code to test the performance of OpenMP parallel computing in two compilers—PGI Visual Fortran (PVF) and Intel Visual Fortran (IVF)—on the same desktop (with AMD quad core CPU FX-4130).

I found a disturbing outcome saying that my PVF setting failed to fully launch four thread computing. Can anyone help me figure out how to get my PVF OpenMP setting right, and why PVF spent far more time in completing both the sequential and parallel calculation (controlling for the factor of OpenMP failure)?

Below I list the source code I used for this experiment, the result obtained with IVF and PVF in order. In the end I list the corresponding command line flags shown in the visual studio property pages for PVF and IVF respectively.

Thanks,
Li

Edit: At the end of the day, I got to know how to fully launch 4 threads with PVF compiler. (The PVF outcome is updated below.) However, the result remained incomparable with that generated by IVF compiler. I am wondering if it is the innate flaw of PVF compiler.

program main
  use omp_lib 
  implicit none
  integer :: stime,etime,k=10000000
  integer :: i,thread_id, tid, nthreads
  integer, allocatable :: x(:),y(:),z(:)
  allocate(x(k),y(k),z(k))
  
  write(*,*) '---- Sequential section ----' 
  call system_clock(stime)
  do i=1,k,1
    x(i)=2*i;
    y(i)=i;
    z(i)=x(i)+y(i);
  enddo
  call system_clock(etime)
  write(*,*) 'Sequential elapsed time: ', etime-stime, 'microseconds'

  
  write(*,*) '---- OpenMP section ----'
  !$omp parallel private(thread_id)
  thread_id = omp_get_thread_num()
  write(*,*) 'Thread ', thread_id, ': Hello.'
  !$OMP BARRIER
  write(*,*) 'Thread ', thread_id, ': Bye bye.'
  !$omp end parallel
  
  !$omp parallel private(tid) 
  tid = omp_get_thread_num()
  nthreads = omp_get_num_threads()
  write(*,*) 'Threads = ', nthreads  
  !$OMP BARRIER
  call system_clock(stime)
  !$omp do  
    do i =1,k,1
      x(i)=i
      y(i)=2*i
      z(i)=x(i)+y(i)
    enddo 
  !$omp end do 
  !$omp end parallel
  call system_clock(etime)
  write(*,*) 'OpenMP elapsed time:', etime-stime, 'microseconds'
  
end program main

The results from PVF:

 ---- Sequential section ----
 ---- Sequential section ----
 Sequential elapsed time:         62000 microseconds
 ---- OpenMP section ----
 Thread             0 : Hello.
 Thread             2 : Hello.
 Thread             3 : Hello.
 Thread             1 : Hello.
 Thread             1 : Bye bye.
 Thread             0 : Bye bye.
 Thread             2 : Bye bye.
 Thread             3 : Bye bye.
 Threads =             4
 Threads =             4
 Threads =             4
 Threads =             4
 OpenMP elapsed time:        16000 microseconds

The results from IVF:

 ---- Sequential section ----
 Sequential elapsed time:          580 microseconds
 ---- OpenMP section ----
 Thread            2 : Hello.
 Thread            1 : Hello.
 Thread            0 : Hello.
 Thread            3 : Hello.
 Thread            0 : Bye bye.
 Thread            2 : Bye bye.
 Thread            3 : Bye bye.
 Thread            1 : Bye bye.
 Threads =            4
 Threads =            4
 Threads =            4
 Threads =            4
 OpenMP elapsed time:          60 microseconds

The flags I used for compiling:

IVF: /nologo /O3 /Qopenmp /module:"Release\\" /object:"Release\\" /Fd"Release\vc100.pdb" /libs:static /threads /c



PVF: -g -Bstatic -Mbackslash -mp -fastsse -Mipa=fast,inline -O3 -Mvect=simd:256 -Minline -Mframe -Munroll=n:4 -Mconcur -Knoieee -Minform=warn -Minfo=mp

I think the main issue might be that you are assuming system_clock returns microseconds. The standard doesn’t specify what it has to be.

For example, on my Linux box, with ifort, if you run:

call system_clock(count_rate=clock_rate)

it returns 1000000. On PGI, it returns 10000000. My guess is on your box, Intel’s clock rate is 100x lower than PGI (assuming the sequential time is roughly the same).

Thus, to be consistent, you have to know the rate and divide by it:

integer :: clock_start,clock_end,clock_rate
real :: elapsed_time
...
call system_clock(count_rate=clock_rate)
call system_clock(count=clock_start)
...
call system_clock(count=clock_end)
elapsed_time = real((clock_end-clock_start)/clock_rate)

I’m pretty sure this sequence puts elapsed_time in seconds, so if you want milliseconds, say, you’ll need to multiply by 1000.

Of course, if it this doesn’t make things look consistent…then there’s a problem!

Hope this helps,
Matt

Thank you for the answer, Matt. The new comparison I made today is shown below. It is just as what you said. The clock rate of IVF is 100x slower than that of PVF. Based on the new results in the release mode, PVF is comparable with IVF, at least in sequential part.

Li

PVF:

 clock_rate      1000000
 ---- Sequential section ----
 Sequential elapsed time:         68000 
 ---- OpenMP section ----
 Thread             0 : Hello.
 Thread             3 : Hello.
 Thread             2 : Hello.
 Thread             1 : Hello.
 Thread             1 : Bye bye.
 Thread             2 : Bye bye.
 Thread             0 : Bye bye.
 Thread             3 : Bye bye.
 Threads =             4
 Threads =             4
 Threads =             4
 Threads =             4
 OpenMP elapsed time:        16000

IVF:

 clock_rate       10000
 ---- Sequential section ----
 Sequential elapsed time:          660 
 ---- OpenMP section ----
 Thread            0 : Hello.
 Thread            1 : Hello.
 Thread            2 : Hello.
 Thread            3 : Hello.
 Thread            1 : Bye bye.
 Thread            0 : Bye bye.
 Thread            2 : Bye bye.
 Thread            3 : Bye bye.
 Threads =            4
 Threads =            4
 Threads =            4
 Threads =            4
 OpenMP elapsed time:          70