There are different philosophies about this:

(1) Don’t count floating-point operations. It is a meaningless exercise, report application performance instead, in a manner relevant to the use case.

(2) Count basic arithmetic operations: a division or square root is one operation, just like multiplication and addition are one operation. Used, for example, by the Sandia Benchmark, according to this 1987 NASA memo (https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870008008.pdf).

(3) Use magic weights attached to each operation. E.g. Add, subtract, comparison, multiply: 1 flop; division or square root 4 flop; sine or exponential 8 flop. Used in this paper, for example: http://ieeexplore.ieee.org/document/1593441/. Best I know these magic weights have been used since the 1970s, but I have never seen a justification for the numbers picked. The NASA memo already mentioned states that these weights are used by the LLL (Lawrence Livermore Loops) benchmark.

(4) Disassemble the binary code to actually count floating-point instructions, or use hardware profiling counters to do so. The question remains what qualifies as an operation, e.g. is a conversion instruction counted, is an FMA one or two floating-point operations?

I am firmly in camp (1). On one platform, division may be a single hardware instruction, on another platform if may map to lengthy emulation code including many individual floating-point instructions. With special function units in GPUs, this may extend beyond algebraic functions to transcendental ones. How should one account for that?

Note that benchmarks like HPC Linpack simply assume a certain number of floating-point operations for a given-size matrix (Linpack: 2/3 * n**3 + 2 * n**2), so the FLOPS reported are actually simply based on the inverse of the run time.