How to calculate the total number of FOP and floating-point performance of special operations(exp sin sqrt)?

When measuring an algorithm, if there are division operations, how to calculate the total number of FOP and floating-point performance?

For example, n2 matrix multiplication, the calculation of n3 * 2flops (a multiplication, an addition), assuming that using the same data set n2, we change multiplication operations of the matrix multiplication into the division operations, how to calculate flops. Is it the same with the result of matrix multiplication?

There are different philosophies about this:

(1) Don’t count floating-point operations. It is a meaningless exercise, report application performance instead, in a manner relevant to the use case.

(2) Count basic arithmetic operations: a division or square root is one operation, just like multiplication and addition are one operation. Used, for example, by the Sandia Benchmark, according to this 1987 NASA memo ([url][/url]).

(3) Use magic weights attached to each operation. E.g. Add, subtract, comparison, multiply: 1 flop; division or square root 4 flop; sine or exponential 8 flop. Used in this paper, for example: [url]Mobile Agents Based Load Balancing Method for Parallel Applications | IEEE Conference Publication | IEEE Xplore. Best I know these magic weights have been used since the 1970s, but I have never seen a justification for the numbers picked. The NASA memo already mentioned states that these weights are used by the LLL (Lawrence Livermore Loops) benchmark.

(4) Disassemble the binary code to actually count floating-point instructions, or use hardware profiling counters to do so. The question remains what qualifies as an operation, e.g. is a conversion instruction counted, is an FMA one or two floating-point operations?

I am firmly in camp (1). On one platform, division may be a single hardware instruction, on another platform if may map to lengthy emulation code including many individual floating-point instructions. With special function units in GPUs, this may extend beyond algebraic functions to transcendental ones. How should one account for that?

Note that benchmarks like HPC Linpack simply assume a certain number of floating-point operations for a given-size matrix (Linpack: 2/3 * n3 + 2 * n2), so the FLOPS reported are actually simply based on the inverse of the run time.

Thanks for your detailed reply and the evidence with paper and website.

However, I still have some puzzles.

  1. How to calculate the number of floating-point operations. Why is it (2/3 * n3 + 2 * n2) for a given-size matrix of LINPACK?
  2. I check the implementation of exp and sqrt online. There are to many methods, such as Newton iteration method and dichotomy. How to implement exp and sqrt and how to calculate the number of floating-point operations of exp and sqrt in CPU and GPU? Do they also use the counting method in you mentioned (3) to calculate the number of floating-point operations of exp and sqrt for CPU, GPU and LINPACK?

Thank you so much!

[1] As far as I understand the number of floating-point operations in LINPACK is calculated as the asymptotic number of floating-point adds plus floating-point multiplies, assuming the system is solved by LU decomposition, where LU decomposition is derived from Gauss elimination. So these are “abstract” or “hypothetical” floating-point operation counts only, simply used as a scale factor for the execution time based on matrix size.

[2] As I stated, you can count the floating-point operations for more complex operations like sqrt or exp in three different ways. Which one you chose is completely up to you.

Precisely because the implementation of even fairly basic math operations (like division) can differ quite widely among platforms in the number of machine instructions that are sent to the FPU does “flop counting” not make sense, in my opinion. Frankly, I consider it junk science, and advise strongly not to use it. Focus on appropriate application-specific performance metrics instead, such as “simulated time per day of wall-clock processing” (in molecular dynamics), or “1Kx1K images processed per second” (for real-time feature recognition).