OpenMP excessive power consumption for waiting threads

marioeroy · January 17, 2024, 4:51pm

The OpenMP power consumption test is with the -p argument to primes1 or primes3 which involves ordered output or one thread writing output at a time. Other threads wait their turn, orderly. I expect for the waiting threads to be idle or consume low CPU utilization. That is not the case and seeing full 6400% CPU utilization (AMD Threadripper 3970X - 64 logical CPU threads) for printing prime numbers to /dev/null. Nothing like GNU GCC consuming just173% for the same test.

Prime Demos

gcc -o primes1.gcc -O3 -fopenmp -I../src primes1.c -lm
clang -o primes1.clang -O3 -fopenmp -I../src primes1.c -lm
nvc -o primes1.nvc -O3 -mp=multicore -I../src primes1.c -lm

gcc -o primes3.gcc -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm
clang -o primes3.clang -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm
nvc -o primes3.nvc -O3 -mp=multicore -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm

OpenMP Ordered Power Consumption Test

Threadripper 3970X idle (browser NV forums page)  120 watts

./primes1.gcc   1e10 -p >/dev/null   10.173 secs, 201 watts
./primes1.clang 1e10 -p >/dev/null   12.729 secs, 288 watts
./primes1.nvc   1e10 -p >/dev/null   21.346 secs, 322 watts

./primes3.gcc   1e10 -p >/dev/null    7.092 secs, 181 watts
./primes3.clang 1e10 -p >/dev/null    8.876 secs, 274 watts
./primes3.nvc   1e10 -p >/dev/null   11.080 secs, 361 watts

OpenMP Performance Test

Threadripper 3970X idle (browser NV forums page)  120 watts

./primes1.gcc   1e12                 16.168 secs, 399 watts
./primes1.clang 1e12                 16.274 secs, 395 watts
./primes1.nvc   1e12                 14.780 secs, 393 watts

./primes3.gcc   1e12                  5.762 secs, 437 watts
./primes3.clang 1e12                  6.277 secs, 434 watts
./primes3.nvc   1e12                  5.755 secs, 442 watts

Am I wrong to wish for the waiting threads to be idle until signaled the mutex lock is available? The GNU GCC compiler passes, consuming much lower power consumption for ordered block.

Needless to say, I’m not in favor of mutex spin-loops if that is the reason seeing high power consumption for NVIDIA OpenMP ordered. What about thousands or millions of compute nodes (also cloud) using NVIDIA HPC compilers, running ordered or exclusive blocks? Does that mean cloud customers pay extra power consumption simply for threads waiting their turn?

I’m hoping the NVIDIA compiler engineers can resolve this.

I first witnessed the power consumption issue using Codon and will be submitting an issue ticket for LLVM.

github.com/exaloop/codon

Excessive power-consumption for threads waiting to acquire mutex

opened 07:30AM - 28 Aug 23 UTC

closed 03:48AM - 08 Sep 23 UTC

marioroy

I completed my journey learning Codon. Something I observed are threads in a …busy CPU loop waiting to acquire the lock. To reproduce, the C and Codon demonstrations live inside the [examples](https://github.com/marioroy/mce-sandbox) folder. To diagnose, run primes1, 2 or primes3, 4 and monitor top. Also, wattage if you can. The power-consumption wasted is greater than 100 watts on a big box, simply for threads waiting their turn. This is merely a demonstration to elevate the needle in the hay stack. Typically, just few threads are enough to print primes. Be sure to direct output to `/dev/null`. Pressing Ctrl-C will end the process. ```text # C demo Top reports 176% OMP_NUM_THREADS=64 primes1 1e10 -p >/dev/null # 207 watts # Codon demo Top reports 6374% OMP_NUM_THREADS=64 primes2 1e10 -p >/dev/null # 315 watts :( ``` The behavior is correct for the C/OpenMP demonstration with regards to low CPU utilization and power consumption. Threads wait their turn to print primes, serially and orderly. One should not see 6400% CPU utilization for this use case. Simulation: Imagine a large data center using Codon to run parallel on thousands of compute nodes. Why do threads involve busy CPU loop while waiting to acquire the lock? The C OpenMP demonstration shows that it's possible for threads to wait without the busy CPU loop.

marioeroy · January 17, 2024, 6:20pm

I created an issue ticket for LLVM.

github.com/llvm/llvm-project

OpenMP excessive power consumption for waiting threads

opened 06:19PM - 17 Jan 24 UTC

marioroy

new issue

Re-posting from https://forums.developer.nvidia.com/t/openmp-excessive-power-con…sumption-for-waiting-threads/279272 "The OpenMP power consumption test is with the `-p` argument to `primes1` or `primes3` which involves ordered output or one thread writing output at a time. Other threads wait their turn, orderly. I expect for the waiting threads to be idle or consume low CPU utilization. That is not the case and seeing full 6400% CPU utilization (AMD Threadripper 3970X - 64 logical CPU threads) for printing prime numbers to /dev/null. Nothing like GNU GCC consuming just173% for the same test." I see also, near 6400% CPU utilization using clang for the power consumption test, during orderly output. **[Prime Demos](https://github.com/marioroy/mce-sandbox/tree/main/demos)** ```text gcc -o primes1.gcc -O3 -fopenmp -I../src primes1.c -lm clang -o primes1.clang -O3 -fopenmp -I../src primes1.c -lm nvc -o primes1.nvc -O3 -mp=multicore -I../src primes1.c -lm gcc -o primes3.gcc -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm clang -o primes3.clang -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm nvc -o primes3.nvc -O3 -mp=multicore -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm ``` **OpenMP Ordered Power Consumption Test** ```text Threadripper 3970X idle (browser NV forums page) 120 watts ./primes1.gcc 1e10 -p >/dev/null 10.173 secs, 201 watts ./primes1.clang 1e10 -p >/dev/null 12.729 secs, 288 watts ./primes1.nvc 1e10 -p >/dev/null 21.346 secs, 322 watts ./primes3.gcc 1e10 -p >/dev/null 7.092 secs, 181 watts ./primes3.clang 1e10 -p >/dev/null 8.876 secs, 274 watts ./primes3.nvc 1e10 -p >/dev/null 11.080 secs, 361 watts ``` **OpenMP Performance Test** ```text Threadripper 3970X idle (browser NV forums page) 120 watts ./primes1.gcc 1e12 16.168 secs, 399 watts ./primes1.clang 1e12 16.274 secs, 395 watts ./primes1.nvc 1e12 14.780 secs, 393 watts ./primes3.gcc 1e12 5.762 secs, 437 watts ./primes3.clang 1e12 6.277 secs, 434 watts ./primes3.nvc 1e12 5.755 secs, 442 watts ``` I first witnessed the power consumption issue using Codon. https://github.com/exaloop/codon/issues/456 Is it okay for waiting threads to be spinning the CPU during ordered or exclusive blocks? I wonder about cloud customers possibly paying extra power consumption simply for threads waiting their turn. The Intel oneAPI compilers are also impacted.

MatColgrove · January 17, 2024, 6:36pm

Hi Mario,

Yes, the OMP_WAIT_POLICY is set to “active” by default meaning threads will enter a spin lock when waiting. Setting OMP_WAIT_POLICY to “passive” will have them sleep when waiting.

Another environment variable to consider is MP_SPIN, which sets the number times each thread checks the mutex lock before sleeping. This is similar to GNU’s GOMP_SPINCOUNT. Our default is 1,000,000 but you can lower this by setting MP_SPIN.

Note that setting OMP_WAIT_POLICY to passive or MP_SPIN to a small value can have a detrimental impact on performance as the cost to wake the threads can be high. Though the exact impact will depend on how often they are used.

Hope this helps,
Mat

marioeroy · January 17, 2024, 10:27pm

Using nvc, I was unable to make the waiting threads minimize power consumption, similar to GNU gcc’s implementation.

OMP_WAIT_POLICY – How waiting threads are handled in GNU GCC

" Description:

Specifies whether waiting threads should be active or passive. If the value is PASSIVE, waiting threads should not consume CPU power while waiting; while the value is ACTIVE specifies that they should. If undefined, threads wait actively for a short time before waiting passively. "

The GNU GCC implementation works well and behaves as described.

                        ./primes1.gcc 1e10 -p >/dev/null   172% CPU Utilization
OMP_WAIT_POLICY=passive ./primes1.gcc 1e10 -p >/dev/null   133%
OMP_WAIT_POLICY=active  ./primes1.gcc 1e10 -p >/dev/null  6400%

MatColgrove · January 17, 2024, 11:48pm

This is my misunderstanding of your original question. OMP_WAIT_POLICY has to do with what happens to the thread between OpenMP regions. Here all the compute and prints occur within a parallel region so setting it to passive wouldn’t have any effect. (I did confirm OMP_WAIT_POLICY is working as expected between parallel regions).

When I run both nvc and gcc, I don’t see a difference between the CPU%:

With nvc:
% env OMP_NUM_THREADS=2 numactl -C 0-1 time nvc.out 1e10 -p > /dev/null
Seconds: 13.513
26.97user 0.01system 0:13.51elapsed 199%CPU (0avgtext+0avgdata 11344maxresident)k
0inputs+0outputs (0major+1933minor)pagefaults 0swaps

With gcc:
% env OMP_NUM_THREADS=2 numactl -C 0-1 time gcc.out 1e10 -p > /dev/null
Seconds: 9.931
19.80user 0.02system 0:09.93elapsed 199%CPU (0avgtext+0avgdata 2980maxresident)k
0inputs+0outputs (0major+1541minor)pagefaults 0swaps

Though I do see the % drop when setting “passive” with gcc:

% env OMP_NUM_THREADS=2 numactl -C 0-1 time gcc.out 1e10 -p > /dev/null
Seconds: 10.314
13.71user 0.02system 0:10.31elapsed 133%CPU (0avgtext+0avgdata 2992maxresident)k

My guess is that they are only using 1 thread to execute the ordered region as opposed to nvomp which has each thread wait it’s turn to print.

I’ll need to talk with our OpenMP engineers to see if this behavior would be something we’d want to replicate.

I’m also wondering if the power difference is more due to the longer run time caused by the I/O issue when redirecting output from an ordered region that you reported earlier, i.e. TPR#34995?

marioeroy · January 18, 2024, 2:49am

Each thread compute primes in parallel for a segment. For the -p argument, each thread outputs primes in an orderly fashion.

Try running on a large box with OMP_WAIT_POLICY=passive and comparing nvc vs. gcc. Take note of the total CPU utilization. This issue request is about the NVIDIA OpenMP implementation consuming unnecessary power consumption compared to gcc. The same is true of clang. A high CPU utilization equates to higher power consumption.

In other words, OMP_WAIT_POLICY=passive still consumes near 100% CPU cycle per waiting thread. The issue is more noticeable running more threads. Is this unexpected behavior for the passive policy?

marioeroy · January 18, 2024, 10:18am

Using Intel’s OpenMP library libiomp5 (default OMP_WAIT_POLICY=passive), I see CPU utilization decreasing gradually as threads complete processing. Though, still not reaching GCC’s low CPU utilization for waiting threads.

# NVC
./primes1.nvc 1e10 -p >/dev/null
Seconds: 21.616

LD_PRELOAD=/home/mario/miniconda3/envs/mandel/lib/libiomp5.so \
./primes1.nvc 1e10 -p >/dev/null
Seconds: 14.971

# GCC (low CPU utilization, low power consumption)
./primes1.gcc 1e10 -p >/dev/null
Seconds: 10.197

I found an example on the web involving no IO.

program Console3
   use omp_lib
   implicit none
   integer i
   !$OMP PARALLEL
   !$OMP MASTER
   do i = 1, 4
      !$OMP TASK FIRSTPRIVATE(i)
      print *, 'Hello World', omp_get_thread_num()
      !$OMP END TASK
   end do
   !$OMP TASKWAIT
   pause !Note CPU useage is high while we wait for user to press enter
   !$OMP END MASTER
   !$OMP BARRIER
   !$OMP END PARALLEL
end program

The CPU utilization reaches 0% using GCC, 6300% for NVIDIA OpenMP, and 2100% using Intel OpenMP.

# GFORTRAN
OMP_WAIT_POLICY=passive ./test.gnu     0% CPU, 127 watts
OMP_WAIT_POLICY=active  ./test.gnu  6300% CPU, 350 watts

# NVFORTRAN
OMP_WAIT_POLICY=passive MP_SPIN=0 \
./test.nv                           6300% CPU, 276 watts

OMP_WAIT_POLICY=passive ./test.nv   6300% CPU, 276 watts
OMP_WAIT_POLICY=active  ./test.nv   6300% CPU, 276 watts

# NVFORTRAN PRELOAD libiomp5

OMP_WAIT_POLICY=passive \
LD_PRELOAD=/home/mario/.local/lib/libiomp5.so \
./test.nv                           2100% CPU, 326 watts

OMP_WAIT_POLICY=passive KMP_BLOCKTIME=0 \
LD_PRELOAD=/home/mario/.local/lib/libiomp5.so \
./test.nv                           2100% CPU, 326 watts

OMP_WAIT_POLICY=active \
LD_PRELOAD=/home/mario/.local/lib/libiomp5.so \
./test.nv                           6300% CPU, 364 watts

Why does the passive policy have no effect using NVIDIA OpenMP? The GNU compilers work as one would expect, ensuring minimum power consumption for waiting threads. The test system idles at around 120 watts. Just one browser window (this forum page) is running in the background.

MatColgrove · January 20, 2024, 12:16am

Again, OMP_WAIT_POLICY only effect the behavior of threads in-between parallel regions. This is a single region.

As for the GNU behavior, I’m not sure but will ask our OpenMP engineers if they have any idea on what they might be doing.

marioeroy · January 20, 2024, 10:25pm

Is NVIDIA OpenMP accessing constant or read-only memory serially vs. allowing multiple readers simultaneously? I fixed the primes1.c demonstration. I missed adding firstprivate(unset_bit). That reduced the time for primes1.nvc 1e10 -p >/dev/null to 14 seconds, from 21 seconds previously.

GCC and clang were not impacted, possibly from accessing the constant array via a shared lock, allowing multiple readers. The var unset_bit is defined in ../src/bits.h. Well, that explains why nvc was taking noticeably longer for printing primes. A later chunk completing early must wait for prior chunks to output.

Below, I captured strace -f output out of curiosity.

OMP_WAIT_POLICY=passive strace -f ./primes1.clang 1e9 -p >/dev/null 2>/tmp/oclang
grep " = 0$" /tmp/oclang | cut -c12- | sort | uniq -c | sort -rn | head

OMP_WAIT_POLICY=passive strace -f ./primes1.nvc 1e9 -p >/dev/null 2>/tmp/onvc
grep " = 0$" /tmp/onvc | cut -c12- | sort | uniq -c | sort -rn | head

OMP_WAIT_POLICY=passive strace -f ./primes1.gcc 1e9 -p >/dev/null 2>/tmp/ogcc
grep " = 0$" /tmp/ogcc | cut -c12- | sort | uniq -c | sort -rn | head

clang  870446  <... sched_yield resumed>)  = 0
          646  <... futex resumed>)        = 0     
          403  sched_yield()               = 0     

nvc    223547  <... sched_yield resumed>)  = 0
          361  sched_yield()               = 0     

gcc       352  <... futex resumed>)        = 0

OMP_WAIT_POLICY=passive ./primes1.clang 1e9 -p >/dev/null
Seconds: 1.288

OMP_WAIT_POLICY=passive ./primes1.nvc 1e9 -p >/dev/null
Seconds: 1.444

OMP_WAIT_POLICY=passive ./primes1.gcc 1e9 -p >/dev/null
Seconds: 1.045

How cool when passive no longer consumes extra power consumption for waiting threads. There are use-cases for this. The primes1.c and primes3.c are examples. These do chunking and move alone until exhausting segments or input. When printing primes, later chunks completing early must wait for prior chunks to output primes. It works well using gcc and fast. Why not nvc?

Topic		Replies	Views
Each thread in each processor Legacy PGI Compilers	12	17976	October 13, 2006
Test Multi Threading Spinning CUDA Programming and Performance	32	4863	July 20, 2011
Process eating CPU Legacy PGI Compilers	2	11609	September 28, 2005
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14052	September 12, 2015
Fortran with OpenMP almost no speedup Legacy PGI Compilers	15	12635	August 20, 2014
Deep understanding how block is actually processed in MP CUDA Programming and Performance	28	30365	December 15, 2010
Cuda + omp = big slowdown CUDA Programming and Performance	4	1322	August 20, 2013
openmp does not perform as expected Legacy PGI Compilers	3	17683	August 8, 2008
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8696	December 18, 2008
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20207	May 4, 2007

OpenMP excessive power consumption for waiting threads

OMP_WAIT_POLICY – How waiting threads are handled in GNU GCC

Related topics