The OpenMP power consumption test is with the -p
argument to primes1
or primes3
which involves ordered output or one thread writing output at a time. Other threads wait their turn, orderly. I expect for the waiting threads to be idle or consume low CPU utilization. That is not the case and seeing full 6400% CPU utilization (AMD Threadripper 3970X - 64 logical CPU threads) for printing prime numbers to /dev/null. Nothing like GNU GCC consuming just173% for the same test.
gcc -o primes1.gcc -O3 -fopenmp -I../src primes1.c -lm
clang -o primes1.clang -O3 -fopenmp -I../src primes1.c -lm
nvc -o primes1.nvc -O3 -mp=multicore -I../src primes1.c -lm
gcc -o primes3.gcc -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm
clang -o primes3.clang -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm
nvc -o primes3.nvc -O3 -mp=multicore -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm
OpenMP Ordered Power Consumption Test
Threadripper 3970X idle (browser NV forums page) 120 watts
./primes1.gcc 1e10 -p >/dev/null 10.173 secs, 201 watts
./primes1.clang 1e10 -p >/dev/null 12.729 secs, 288 watts
./primes1.nvc 1e10 -p >/dev/null 21.346 secs, 322 watts
./primes3.gcc 1e10 -p >/dev/null 7.092 secs, 181 watts
./primes3.clang 1e10 -p >/dev/null 8.876 secs, 274 watts
./primes3.nvc 1e10 -p >/dev/null 11.080 secs, 361 watts
OpenMP Performance Test
Threadripper 3970X idle (browser NV forums page) 120 watts
./primes1.gcc 1e12 16.168 secs, 399 watts
./primes1.clang 1e12 16.274 secs, 395 watts
./primes1.nvc 1e12 14.780 secs, 393 watts
./primes3.gcc 1e12 5.762 secs, 437 watts
./primes3.clang 1e12 6.277 secs, 434 watts
./primes3.nvc 1e12 5.755 secs, 442 watts
Am I wrong to wish for the waiting threads to be idle until signaled the mutex lock is available? The GNU GCC compiler passes, consuming much lower power consumption for ordered block.
Needless to say, I’m not in favor of mutex spin-loops if that is the reason seeing high power consumption for NVIDIA OpenMP ordered. What about thousands or millions of compute nodes (also cloud) using NVIDIA HPC compilers, running ordered or exclusive blocks? Does that mean cloud customers pay extra power consumption simply for threads waiting their turn?
I’m hoping the NVIDIA compiler engineers can resolve this.
I first witnessed the power consumption issue using Codon and will be submitting an issue ticket for LLVM.