OpenMP excessive power consumption for waiting threads

The OpenMP power consumption test is with the -p argument to primes1 or primes3 which involves ordered output or one thread writing output at a time. Other threads wait their turn, orderly. I expect for the waiting threads to be idle or consume low CPU utilization. That is not the case and seeing full 6400% CPU utilization (AMD Threadripper 3970X - 64 logical CPU threads) for printing prime numbers to /dev/null. Nothing like GNU GCC consuming just173% for the same test.

Prime Demos

gcc -o primes1.gcc -O3 -fopenmp -I../src primes1.c -lm
clang -o primes1.clang -O3 -fopenmp -I../src primes1.c -lm
nvc -o primes1.nvc -O3 -mp=multicore -I../src primes1.c -lm

gcc -o primes3.gcc -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm
clang -o primes3.clang -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm
nvc -o primes3.nvc -O3 -mp=multicore -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm

OpenMP Ordered Power Consumption Test

Threadripper 3970X idle (browser NV forums page)  120 watts

./primes1.gcc   1e10 -p >/dev/null   10.173 secs, 201 watts
./primes1.clang 1e10 -p >/dev/null   12.729 secs, 288 watts
./primes1.nvc   1e10 -p >/dev/null   21.346 secs, 322 watts

./primes3.gcc   1e10 -p >/dev/null    7.092 secs, 181 watts
./primes3.clang 1e10 -p >/dev/null    8.876 secs, 274 watts
./primes3.nvc   1e10 -p >/dev/null   11.080 secs, 361 watts

OpenMP Performance Test

Threadripper 3970X idle (browser NV forums page)  120 watts

./primes1.gcc   1e12                 16.168 secs, 399 watts
./primes1.clang 1e12                 16.274 secs, 395 watts
./primes1.nvc   1e12                 14.780 secs, 393 watts

./primes3.gcc   1e12                  5.762 secs, 437 watts
./primes3.clang 1e12                  6.277 secs, 434 watts
./primes3.nvc   1e12                  5.755 secs, 442 watts

Am I wrong to wish for the waiting threads to be idle until signaled the mutex lock is available? The GNU GCC compiler passes, consuming much lower power consumption for ordered block.

Needless to say, I’m not in favor of mutex spin-loops if that is the reason seeing high power consumption for NVIDIA OpenMP ordered. What about thousands or millions of compute nodes (also cloud) using NVIDIA HPC compilers, running ordered or exclusive blocks? Does that mean cloud customers pay extra power consumption simply for threads waiting their turn?

I’m hoping the NVIDIA compiler engineers can resolve this.

I first witnessed the power consumption issue using Codon and will be submitting an issue ticket for LLVM.

I created an issue ticket for LLVM.

Hi Mario,

Yes, the OMP_WAIT_POLICY is set to “active” by default meaning threads will enter a spin lock when waiting. Setting OMP_WAIT_POLICY to “passive” will have them sleep when waiting.

Another environment variable to consider is MP_SPIN, which sets the number times each thread checks the mutex lock before sleeping. This is similar to GNU’s GOMP_SPINCOUNT. Our default is 1,000,000 but you can lower this by setting MP_SPIN.

Note that setting OMP_WAIT_POLICY to passive or MP_SPIN to a small value can have a detrimental impact on performance as the cost to wake the threads can be high. Though the exact impact will depend on how often they are used.

Hope this helps,
Mat

Using nvc, I was unable to make the waiting threads minimize power consumption, similar to GNU gcc’s implementation.

OMP_WAIT_POLICY – How waiting threads are handled in GNU GCC

" Description:

Specifies whether waiting threads should be active or passive. If the value is PASSIVE, waiting threads should not consume CPU power while waiting; while the value is ACTIVE specifies that they should. If undefined, threads wait actively for a short time before waiting passively. "

The GNU GCC implementation works well and behaves as described.

                        ./primes1.gcc 1e10 -p >/dev/null   172% CPU Utilization
OMP_WAIT_POLICY=passive ./primes1.gcc 1e10 -p >/dev/null   133%
OMP_WAIT_POLICY=active  ./primes1.gcc 1e10 -p >/dev/null  6400%

This is my misunderstanding of your original question. OMP_WAIT_POLICY has to do with what happens to the thread between OpenMP regions. Here all the compute and prints occur within a parallel region so setting it to passive wouldn’t have any effect. (I did confirm OMP_WAIT_POLICY is working as expected between parallel regions).

When I run both nvc and gcc, I don’t see a difference between the CPU%:

With nvc:
% env OMP_NUM_THREADS=2 numactl -C 0-1 time nvc.out 1e10 -p > /dev/null
Seconds: 13.513
26.97user 0.01system 0:13.51elapsed 199%CPU (0avgtext+0avgdata 11344maxresident)k
0inputs+0outputs (0major+1933minor)pagefaults 0swaps

With gcc:
% env OMP_NUM_THREADS=2 numactl -C 0-1 time gcc.out 1e10 -p > /dev/null
Seconds: 9.931
19.80user 0.02system 0:09.93elapsed 199%CPU (0avgtext+0avgdata 2980maxresident)k
0inputs+0outputs (0major+1541minor)pagefaults 0swaps

Though I do see the % drop when setting “passive” with gcc:

% env OMP_NUM_THREADS=2 numactl -C 0-1 time gcc.out 1e10 -p > /dev/null
Seconds: 10.314
13.71user 0.02system 0:10.31elapsed 133%CPU (0avgtext+0avgdata 2992maxresident)k

My guess is that they are only using 1 thread to execute the ordered region as opposed to nvomp which has each thread wait it’s turn to print.

I’ll need to talk with our OpenMP engineers to see if this behavior would be something we’d want to replicate.

I’m also wondering if the power difference is more due to the longer run time caused by the I/O issue when redirecting output from an ordered region that you reported earlier, i.e. TPR#34995?

Each thread compute primes in parallel for a segment. For the -p argument, each thread outputs primes in an orderly fashion.

Try running on a large box with OMP_WAIT_POLICY=passive and comparing nvc vs. gcc. Take note of the total CPU utilization. This issue request is about the NVIDIA OpenMP implementation consuming unnecessary power consumption compared to gcc. The same is true of clang. A high CPU utilization equates to higher power consumption.

In other words, OMP_WAIT_POLICY=passive still consumes near 100% CPU cycle per waiting thread. The issue is more noticeable running more threads. Is this unexpected behavior for the passive policy?

Using Intel’s OpenMP library libiomp5 (default OMP_WAIT_POLICY=passive), I see CPU utilization decreasing gradually as threads complete processing. Though, still not reaching GCC’s low CPU utilization for waiting threads.

# NVC
./primes1.nvc 1e10 -p >/dev/null
Seconds: 21.616

LD_PRELOAD=/home/mario/miniconda3/envs/mandel/lib/libiomp5.so \
./primes1.nvc 1e10 -p >/dev/null
Seconds: 14.971

# GCC (low CPU utilization, low power consumption)
./primes1.gcc 1e10 -p >/dev/null
Seconds: 10.197

I found an example on the web involving no IO.

program Console3
   use omp_lib
   implicit none
   integer i
   !$OMP PARALLEL
   !$OMP MASTER
   do i = 1, 4
      !$OMP TASK FIRSTPRIVATE(i)
      print *, 'Hello World', omp_get_thread_num()
      !$OMP END TASK
   end do
   !$OMP TASKWAIT
   pause !Note CPU useage is high while we wait for user to press enter
   !$OMP END MASTER
   !$OMP BARRIER
   !$OMP END PARALLEL
end program

The CPU utilization reaches 0% using GCC, 6300% for NVIDIA OpenMP, and 2100% using Intel OpenMP.

# GFORTRAN
OMP_WAIT_POLICY=passive ./test.gnu     0% CPU, 127 watts
OMP_WAIT_POLICY=active  ./test.gnu  6300% CPU, 350 watts

# NVFORTRAN
OMP_WAIT_POLICY=passive MP_SPIN=0 \
./test.nv                           6300% CPU, 276 watts

OMP_WAIT_POLICY=passive ./test.nv   6300% CPU, 276 watts
OMP_WAIT_POLICY=active  ./test.nv   6300% CPU, 276 watts

# NVFORTRAN PRELOAD libiomp5

OMP_WAIT_POLICY=passive \
LD_PRELOAD=/home/mario/.local/lib/libiomp5.so \
./test.nv                           2100% CPU, 326 watts

OMP_WAIT_POLICY=passive KMP_BLOCKTIME=0 \
LD_PRELOAD=/home/mario/.local/lib/libiomp5.so \
./test.nv                           2100% CPU, 326 watts

OMP_WAIT_POLICY=active \
LD_PRELOAD=/home/mario/.local/lib/libiomp5.so \
./test.nv                           6300% CPU, 364 watts

Why does the passive policy have no effect using NVIDIA OpenMP? The GNU compilers work as one would expect, ensuring minimum power consumption for waiting threads. The test system idles at around 120 watts. Just one browser window (this forum page) is running in the background.

Again, OMP_WAIT_POLICY only effect the behavior of threads in-between parallel regions. This is a single region.

As for the GNU behavior, I’m not sure but will ask our OpenMP engineers if they have any idea on what they might be doing.

Is NVIDIA OpenMP accessing constant or read-only memory serially vs. allowing multiple readers simultaneously? I fixed the primes1.c demonstration. I missed adding firstprivate(unset_bit). That reduced the time for primes1.nvc 1e10 -p >/dev/null to 14 seconds, from 21 seconds previously.

GCC and clang were not impacted, possibly from accessing the constant array via a shared lock, allowing multiple readers. The var unset_bit is defined in ../src/bits.h. Well, that explains why nvc was taking noticeably longer for printing primes. A later chunk completing early must wait for prior chunks to output.

Below, I captured strace -f output out of curiosity.

OMP_WAIT_POLICY=passive strace -f ./primes1.clang 1e9 -p >/dev/null 2>/tmp/oclang
grep " = 0$" /tmp/oclang | cut -c12- | sort | uniq -c | sort -rn | head

OMP_WAIT_POLICY=passive strace -f ./primes1.nvc 1e9 -p >/dev/null 2>/tmp/onvc
grep " = 0$" /tmp/onvc | cut -c12- | sort | uniq -c | sort -rn | head

OMP_WAIT_POLICY=passive strace -f ./primes1.gcc 1e9 -p >/dev/null 2>/tmp/ogcc
grep " = 0$" /tmp/ogcc | cut -c12- | sort | uniq -c | sort -rn | head
clang  870446  <... sched_yield resumed>)  = 0
          646  <... futex resumed>)        = 0     
          403  sched_yield()               = 0     

nvc    223547  <... sched_yield resumed>)  = 0
          361  sched_yield()               = 0     

gcc       352  <... futex resumed>)        = 0     
OMP_WAIT_POLICY=passive ./primes1.clang 1e9 -p >/dev/null
Seconds: 1.288

OMP_WAIT_POLICY=passive ./primes1.nvc 1e9 -p >/dev/null
Seconds: 1.444

OMP_WAIT_POLICY=passive ./primes1.gcc 1e9 -p >/dev/null
Seconds: 1.045

How cool when passive no longer consumes extra power consumption for waiting threads. There are use-cases for this. The primes1.c and primes3.c are examples. These do chunking and move alone until exhausting segments or input. When printing primes, later chunks completing early must wait for prior chunks to output primes. It works well using gcc and fast. Why not nvc?