How to calculate the total number of FOP and floating-point performance of special operations(exp sin sqrt)?

Sherry818 · December 25, 2016, 2:15pm

When measuring an algorithm, if there are division operations, how to calculate the total number of FOP and floating-point performance?

For example, n2 matrix multiplication, the calculation of n3 * 2flops (a multiplication, an addition), assuming that using the same data set n2, we change multiplication operations of the matrix multiplication into the division operations, how to calculate flops. Is it the same with the result of matrix multiplication?

njuffa · December 25, 2016, 5:48pm

There are different philosophies about this:

(1) Don’t count floating-point operations. It is a meaningless exercise, report application performance instead, in a manner relevant to the use case.

(2) Count basic arithmetic operations: a division or square root is one operation, just like multiplication and addition are one operation. Used, for example, by the Sandia Benchmark, according to this 1987 NASA memo ([url]https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870008008.pdf[/url]).

(3) Use magic weights attached to each operation. E.g. Add, subtract, comparison, multiply: 1 flop; division or square root 4 flop; sine or exponential 8 flop. Used in this paper, for example: [url]Mobile Agents Based Load Balancing Method for Parallel Applications | IEEE Conference Publication | IEEE Xplore. Best I know these magic weights have been used since the 1970s, but I have never seen a justification for the numbers picked. The NASA memo already mentioned states that these weights are used by the LLL (Lawrence Livermore Loops) benchmark.

(4) Disassemble the binary code to actually count floating-point instructions, or use hardware profiling counters to do so. The question remains what qualifies as an operation, e.g. is a conversion instruction counted, is an FMA one or two floating-point operations?

I am firmly in camp (1). On one platform, division may be a single hardware instruction, on another platform if may map to lengthy emulation code including many individual floating-point instructions. With special function units in GPUs, this may extend beyond algebraic functions to transcendental ones. How should one account for that?

Note that benchmarks like HPC Linpack simply assume a certain number of floating-point operations for a given-size matrix (Linpack: 2/3 * n3 + 2 * n2), so the FLOPS reported are actually simply based on the inverse of the run time.

Sherry818 · December 26, 2016, 9:14am

Thanks for your detailed reply and the evidence with paper and website.

However, I still have some puzzles.

How to calculate the number of floating-point operations. Why is it (2/3 * n3 + 2 * n2) for a given-size matrix of LINPACK?
I check the implementation of exp and sqrt online. There are to many methods, such as Newton iteration method and dichotomy. How to implement exp and sqrt and how to calculate the number of floating-point operations of exp and sqrt in CPU and GPU? Do they also use the counting method in you mentioned (3) to calculate the number of floating-point operations of exp and sqrt for CPU, GPU and LINPACK?

Thank you so much!

njuffa · December 26, 2016, 10:29pm

[1] As far as I understand the number of floating-point operations in LINPACK is calculated as the asymptotic number of floating-point adds plus floating-point multiplies, assuming the system is solved by LU decomposition, where LU decomposition is derived from Gauss elimination. So these are “abstract” or “hypothetical” floating-point operation counts only, simply used as a scale factor for the execution time based on matrix size.

[2] As I stated, you can count the floating-point operations for more complex operations like sqrt or exp in three different ways. Which one you chose is completely up to you.

Precisely because the implementation of even fairly basic math operations (like division) can differ quite widely among platforms in the number of machine instructions that are sent to the FPU does “flop counting” not make sense, in my opinion. Frankly, I consider it junk science, and advise strongly not to use it. Focus on appropriate application-specific performance metrics instead, such as “simulated time per day of wall-clock processing” (in molecular dynamics), or “1Kx1K images processed per second” (for real-time feature recognition).

Topic		Replies	Views
actual FLOP count versus flops_dp metric CUDA Programming and Performance	4	961	August 21, 2024
Benchmarking a program What is the best option for finding the FLOP for a given thread? CUDA Programming and Performance	10	1211	August 21, 2010
instruction or operation CUDA Programming and Performance	16	3359	March 28, 2019
Counting FLOPS...again how much does each operation count? CUDA Programming and Performance	5	17050	December 14, 2010
Floating point operations by nvprof CUDA Programming and Performance	3	835	October 17, 2018
Floating Point Operations per SEC calculation CUDA Programming and Performance	4	735	September 24, 2021
FLOP count CUDA Programming and Performance	3	6663	December 10, 2008
Counting flops what's in and what's out? CUDA Programming and Performance	0	1785	June 9, 2012
Total FLOPS for the HGMMA instruction Nsight Compute	4	721	March 21, 2024
How to figure out a total number of double operations in cuFFT::cufftExecZ2Z on a device? Legacy PGI Compilers	2	657	February 1, 2022

How to calculate the total number of FOP and floating-point performance of special operations(exp sin sqrt)?

Related topics