Building threaded shareable libraries.

Hugh · July 31, 2006, 12:45pm

Hello,

I’ve currently built a set of FORTRAN routines as a shareable library that’s called by RSI-IDL (IDL - Interactive Data Language & Visualization Software - ITT Visual Information Solutions) via the CALL_EXTERNAL functions. I’d like to convert these routines to be parallel/threaded. There is a main loop in the subroutines that loops over an array, performing an independent calculation on each element, so each iteration of the loop should be able to run on a different processor.

How do I set up the compile command to allow the hyperthreading, i.e. what command switches should I use?

What directives would be best in the routines around the DO…ENDDO loops? I would like to avoid MPI, as I am only interested in harnessing the other CPUs in the machines (for now…).

Finally, how do I ensure the call from IDL is processing on all available CPUs on the machine?

Thanks,
Hugh

MatColgrove · July 31, 2006, 8:38pm

Hi Hugh,

You should be able to accomplish this using OpenMP directives or using your PGI compiler’s auto-parallelization feature. Chapters 5 and 6 of the PGI User’s Guide give a good introduction to using OpenMP.

You can also try using the “-Mconcur” option to enable auto-paralleliztion. Although “-Mconcur” can’t parallelize everything, it usually does a good job for cases as the one you describe. To verifiy that “-Mconcur” correctly parallelizes your loop, add the “-Minfo=mp” and review the output. Depending upon the structure of your code, you may need to use “-Mconcur=innermost”. Again, pleease refer to the PGI User’s Guide for more information.

how do I ensure the call from IDL is processing on all available CPUs on the machine?

At runtime, you need to set the environment variable “NCPUS” to the number of CPUs available.

Hope this helps,
Mat

Hugh · August 1, 2006, 3:02pm

Hi Mat (and anyone else!)

Ok. for testing purposes, I’ve defined the following in my parallel routine:

     print*,'Max Threads:',omp_get_max_threads()

!$OMP PARALLEL SHARED(F,NPTS,NE,E,FL,BM) DEFAULT(NONE)
!$OMP DO PRIVATE(I, FLUX,L,B)
      DO J=1, Npts
         FLUX.N = nE
         DO I=1, nE
            FLUX.ENERGY(i) = E(i)
         ENDDO

         DO I=1,100
            flux.iflux(1) = flux.iflux(1) +
     &                      (I**0.5)*ALOG(J)*COS(I*3.14159)
         ENDDO
C         CALL GETFLUX(FLUX, FL(J), BM(J))

D        PRINT*,'FL, BM:', L,B,Flux.Iflux(1)
         DO I=1,FLUX.N
D           PRINT*,FLUX.Energy(i), Flux.IFlux(i), Flux.DFlux(i)
            F(i,J) = FLUX.IFlux(i)
         ENDDO
      ENDDO
!$OMP END DO
!$OMP END PARALLEL

The

print*,'Max Threads:',omp_get_max_threads()

statement reports (correctly?) the setting of NCPUS. Unfortunately, when viewing the CPU usage, only one CPU ever seems to run, and subsequently, the time taken to run is about the same as for a single thread (NCPUS=1). Here’s the output:

Max Threads:                        1
Module          Type  Count     Only(s)   Avg.(s)     Time(s)   Avg.(s)
SPENVIS_TREP    (U)       1   15.860973 15.860973   15.923153 15.923153

 Max Threads:                        2
Module          Type  Count     Only(s)   Avg.(s)     Time(s)   Avg.(s)
SPENVIS_TREP    (U)       1   15.785954 15.785954   15.848143 15.848143

The output from the make command is:

pgf90 -i8 -fPIC -Bstatic -tp p7-64 -fast -mp -Minfo   -c -o trep_sp.o trep_sp.f
PGF90-W-0093-Type conversion of expression performed (trep_sp.f: 50)
  0 inform,   1 warnings,   0 severes, 0 fatal for trep_sp
trep_sp:
    41, Parallel region activated
    43, Parallel loop activated; static block iteration allocation
    61, Barrier
        Parallel region terminated
rm -f idlSpenvisTrep.linux.64.a
pgf90 -o idlSpenvisTrep.linux.64.so idlSpenvisTrep.o trep_sp.o bext.o bint.o format.o putils.o shell.o transfos.o ae8max.o  ap8max.o  ecp95bd.o  models.o psb97bd.o  trepltv.o  trepstat.o up8min.o   ae8min.o  ap8min.o  ecvbd.o pcp94bd.o  trarap.o up8max.o    utils.o -fast -mp -Minfo   -shared -fPIC -tp p7-64

Any help getting this to distribute across the CPUs is greatfully received.

Thanks,
Hugh

Hugh · August 1, 2006, 3:47pm

Hello All,

As an update, I just tried this on one of our other machines (Dual Core AMD Opteron™ Processor 280, SuSE Linux 9.3 (x86-64)) and the load distribution worked.

However, it doesn’t seem to work on the other machine (Intel(R) Xeon™ CPU 3.40GHz, SuSE Linux 9.2 (x86-64))

Odd…

Hugh

MatColgrove · August 1, 2006, 3:57pm

Hi Hugh,

You need to set the number of threads to run either via the environment variables “NCPUS” or “OMP_NUM_THREADS”, or via a call to “omp_set_num_threads”. In your code, I would add the following:

    integer max_threads
     ...
    max_threads = omp_get_max_threads()
    print*,'Max Threads:', max_threads
    call omp_set_num_threads(max_threads)

!$OMP PARALLEL SHARED(F,NPTS,NE,E,FL,BM) DEFAULT(NONE)
!$OMP DO PRIVATE(I, FLUX,L,B)

Mat

Christopher_Hulbert · August 1, 2006, 4:18pm

You can also print out the number of threads used in the parallel region or print each thread index.

See omp_get_num_threads and omp_get_thread_num

Hugh · August 2, 2006, 8:58am

Many thanks, it all works a treat now.

Hugh