Counting FLOPS...again how much does each operation count?


This question was posted here before, but there was never a clear answer to it.

Let’s assume there is a CUDA code and i want to count the FLOPS of the inner loop. I look at it on PTX-level, because some optimizations were already taken into this.

Obvious the add, sub and mul count as 1 FLOP. But what about

    fma <- should be 1 or 2 FLOPS ?

    rsqrt.approx <- looking at CUDA Programming Guide 5.4.1 one might assume that it takes 8x times longer, so should I count it as 8 ?

    div.approx <- same with rsqrt. 1 oder 8 ?

There may be some more optimization after the PTX stage, but I have no way to decompile it. Regarding the CUDA-Code is worthless, because there are many optimizations in PTX already.

oh and one further question:

the integer addition in statements like:

ld.shared.f32 	%f354, [%rd42+28];

is it handled by a memory management unit or computed by the CUDA cores ?

Only floating point operations defined by the algorithm are countable towards Flops/s. Integer and other instructions cannot be used when computing Flops/s for your application

Thanks for participation. Yet I still don’t know how to weight each floating point operation.

let’s count FMA as 2 operations

now look at “rsqrt”. if you count it as 1 operation, ok, you don’t lie because in the end you have a lower bound for your total FLOPS.

now implement the “fast inverse square root” (which should be called reciprocal) that was used in quake3. it is as precise as rsqrt.approx and features 8 floating point operations. the programm takes like 10% longer, but it has 20% more flops.

isn’t it fair to count rsqrt as 8 flops?

Once you start dealing with arbitrary functions, accurately counting flops becomes very difficult. For example, sqrt is just a special case of a^x where x = .5, so if we say sqrt is 1 flop, then that should be the case for arbitrary x. Or how to deal with the low precision hardware version of sin and cos vs. the more exact but slower software implementations? Do they both count as 1?

How to deal with really nasty functions like Bessel or Gamma functions?

I think the only real answer is that flops are useful when you count instructions and compare to the theoretical maximum of the hardware you’re running on and they’re also useful when comparing implementations of the EXACT same algorithm (and even that can get dicey due to accuracy issues). Comparing flop counts of completely different algorithms is basically futile.

Once you start asking the question “What counts as a FLOP?”, it’s time to step back and figure out what it is you actually want to know. Are you trying to estimate how fast a particular piece of code could potentially run?