OpenMP threads problem on Intel CPU


My system is CentOS 5.11 equipped with two Intel® Xeon® CPU E5-2650 with HyperThreading enabled, so the machine shows there are 32 threads I can use.
I try to run an OpenMP program and set OMP_NUM_THREADS to 8 or more the application does not run correctly. But when I run on a system equipped with AMD CPU, it worked fine.
The program showed error message on Intel CPU:
“OMP_NUM_THREADS or NCPUS value (8) is invalid”
I can only use 4 processors and can not be greater than 4, but I want to use 8 and more processors to fasten my forecast model.My pgf90 version is 6.2 which is a bit old version.

Here is a program, check2.f, that determines if your system
can run OpenMP across multiple threads.

pgf90 check2.f dclock_64.s -o check2 -mp
Here is check2.f

        program test
        integer j,k
        integer omp_get_num_procs, omp_get_max_threads
        integer omp_get_num_threads
        integer thread(20)
        integer,parameter:: max_thrd=4
        real*8 dclock, time1, time2
        do i=1,max_thrd
        end do
        print *,"number of cores  =", omp_get_num_procs()
        print *,"max threads =", omp_get_max_threads()
        print *,"current num threads =", omp_get_num_threads()
!       call system("uname -a")
        j = 200    ! may want to change this
        do ii=1,max_thrd+1

                call omp_set_num_threads(thread(ii))
                time1 = dclock()
!$omp parallel
!$omp do
                do k = 1, thread(max_thrd+1)
                        call delay(j)
!$omp end parallel
                time2 = dclock() - time1
                print *, thread(ii)," core test - delay value =",
     *         j*thread(max_thrd+1),
     *          " time =", time2, " seconds"
        end do
       subroutine delay(n)
       integer n
       integer i
       do i=1,n
          call abc()
       end do
       subroutine abc()
       integer i
       do i=1,1000000
           call def()
       end do
       subroutine def()

Here is the timing routine dclock_64.s

        .file   "dclock-hammer.s"
        .align    8
# .clock:  .double 0.000000001           # 1.0 GHz
# .clock:  .double 0.000000000750187     # 1.33GHz
# .clock:  .double 0.000000000714        # 1.4 GHz
# .clock:  .double 0.000000000666        # 1.5 GHz
# .clock:  .double 0.000000000625        # 1.6 GHz
# .clock:  .double 0.00000000059         # 1.7 GHz
# .clock:  .double 0.0000000005556       # 1.8 GHz
# .clock:  .double 0.0000000005          # 2.0 GHz
# .clock:  .double 0.000000000455        # 2.2 GHz
# .clock:  .double 0.000000000417        # 2.4 GHz
 .clock:  .double 0.000000000376        # 2.66 GHz
# .clock:  .double 0.000000000357        # 2.8 GHz
# .clock:  .double 0.0000000003333       # 3.0 GHz
# .clock:  .double 0.0000000003125       # 3.2 GHz
# .clock:  .double 0.0000000002777       # 3.6 GHz

.low:   .long 0x00000000
.high:  .long 0x00000000

        .globl   _DCLOCK, dclock, _dclock, _dclock_, dclock_
        .byte   0x0f, 0x31

        movl    %eax, .low(%RIP)
        movl    %edx, .high(%RIP)

        fildll  .low(%RIP)
        fmull   .clock(%RIP)
        fstpl   -24(%rsp)
        movsd   -24(%rsp), %xmm0

You should see 1, 2, 4, 8, 16 threads run the same
(non-memory using) computations, and time the
work. Each time the threads double, the processing time should
cut in half from before.
If you extend to 32 threads(max_thrd=5), the hyper-threads, which are not on different cores, should speed nothing up. OpenMP should first assign real cores before assigning hyper-thread cores (which are a second set of registers for the same CPU core).

pgf90 -V
on your system should indicate what type CPUs the compiler thinks you have.

If this works and your code still fails, send your failing program
to so we can take a look.


I just reread the forum entry and realized you had 6.2 release. Back then
we did limit the number of threads Open MP created. Probably 8 or 16.

If you want to use more threads, download the Community Edition 17.4
and use that. A better experience.


Hi, Dave

I tried Community Edition 17.4, and compiled the program successfully, but when I implemented the program, I saw it just took one thread to run. I think it may be the compiler flag problems, but I don’t really understand what should be changed in Makefile.

The following lines are the flags written in Makefile.
LDFLAGS = -L $(PGI)/linux86-64/$(ver)/libso -L $(PGI)/linux86-64/$(ver)/lib -L $(RAP_SHARED_LIB_DIR) -L $(RAP_LIB_DIR)
LOC_PGFFLAGS= -mcmodel=medium -fast -g77libs -mp -Minform=inform -Wl,-Bstatic
GFFLAGS= -mcmodel=medium -fast -g77libs -mp -Minform=inform -Wl,-Bstatic
FFLAGS = -mcmodel=medium -Msave -fast -g77libs -mp -Minform=inform
F90FLAGS = -mcmodel=medium -Msave -fast -g77libs -mp -Minform=inform
LOC_CPPC_CFLAGS = -DPGI_IN_USE -mcmodel=medium

LOC_LDFLAGS = -L $(PGI)/linux86-64/$(ver)/libso -L $(PGI)/linux86-64/$(ver)/lib
LOC_LIBS=-lSpdb -lMdv -ldsserver -ldidss -leispack
-lrapformats -ltoolsa -lrapmath -ldataport -ltdrp
-lpgmp -lpgf90 -lpgf90_rpm1 -lpgf902
-lpgf90rtl -lpgftnrtl $(PGI)/linux86-64/$(ver)/lib/nonuma.o
-lnspgc -lpgc -lm -lgcc -lc -lgcc
-lpthread -lrt

And, I examined the check2.f code, it could use up to 16 threads to run no matter which pgf90 version I use. I have tested 6.2, 8.0, 8.0-5, 12.10 and 17.4, all of them are OK. However, my program cannot use threads greater than 4 on Intel CPU.

Many thanks,

Make sure you use the -mp switch during compilation so that
OpenMP directives are processed.

Make sure you use the -mp switch during LINKING so that multi-threaded openmp routines will be linked, rather than dummy single threaded

I would drop

-mcmodel=medium -g77libs
and see if things still run.

I would not try to list all the PGI libs normally linked because
that is the driver’s job, and the linked list and order changes.

Just add libs not put in normally.


Hi, Dave

I tried your suggestion which is drop -g77libs and add -mp during linking, but still got only one thread to run even though omp_num_threads was greater than one.

Actually, the library during linking have to be listed, or it would report errors of “undefined reference to XXX”. I don’t fully understand what driver do during linking, and what kind of driver will do this. I need to learn more about it.

I list the messages output while compiling the program below. Maybe it can help you to diagnose where goes wrong.

make _CC="gcc34" _CPPC="pgc++ -mp" _FC="pgf90" _F90C="pgf90" \
	DBUG_OPT_FLAGS="" target
make[1]: Entering directory `/work3/myclo/VDRAS/VDRAS_ice_ter'
echo Making C++ program ... ; \
	make _CC="gcc34" _CPPC="pgc++ -mp" _FC="pgf90" DBUG_OPT_FLAGS="" DEBUG_CFLAGS="" DEBUG_LIBS="" DEBUG_LDFLAGS="" SYS_LIBS="" SYS_CFLAGS=" -DLINUX_IL6 -D_BSD_TYPES -DF_UNDERSCORE2" adjoint_moist;
Making C++ program ...
make[2]: Entering directory `/work3/myclo/VDRAS/VDRAS_ice_ter'
Many thanks again.

You are linking PGI 8.0 libs with PGI 17.4? Risky.

Instead of -L/path/to/pgi/f90/libs -lpgi1lib -lpgi2lib etc
just use


so the current pgf90 libs are linked with pgc++ in the correct
versions and order.

Your use of -Msave and -mcmodel=medium suggests you are declaring large arrays statically, and you wish them all to be zero upon program entry. This makes for very long execution times as these huge
arrays inside the executable are loaded and executed.

Better to dynamically create and initialize the big arrays - smaller and faster.