Counting FLOPS...again how much does each operation count?

devkec · December 9, 2010, 10:34am

Hello,

This question was posted here before, but there was never a clear answer to it.

Let’s assume there is a CUDA code and i want to count the FLOPS of the inner loop. I look at it on PTX-level, because some optimizations were already taken into this.

Obvious the add, sub and mul count as 1 FLOP. But what about

[*]fma ← should be 1 or 2 FLOPS ?

[*]rsqrt.approx ← looking at CUDA Programming Guide 5.4.1 one might assume that it takes 8x times longer, so should I count it as 8 ?

[*]div.approx ← same with rsqrt. 1 oder 8 ?

There may be some more optimization after the PTX stage, but I have no way to decompile it. Regarding the CUDA-Code is worthless, because there are many optimizations in PTX already.

devkec · December 9, 2010, 12:46pm

oh and one further question:

the integer addition in statements like:

ld.shared.f32 	%f354, [%rd42+28];

is it handled by a memory management unit or computed by the CUDA cores ?

paulius · December 12, 2010, 5:45am

Only floating point operations defined by the algorithm are countable towards Flops/s. Integer and other instructions cannot be used when computing Flops/s for your application

devkec · December 13, 2010, 6:35pm

Thanks for participation. Yet I still don’t know how to weight each floating point operation.

let’s count FMA as 2 operations

now look at “rsqrt”. if you count it as 1 operation, ok, you don’t lie because in the end you have a lower bound for your total FLOPS.

now implement the “fast inverse square root” (which should be called reciprocal) that was used in quake3. it is as precise as rsqrt.approx and features 8 floating point operations. the programm takes like 10% longer, but it has 20% more flops.

isn’t it fair to count rsqrt as 8 flops?

eelsen · December 13, 2010, 6:54pm

Once you start dealing with arbitrary functions, accurately counting flops becomes very difficult. For example, sqrt is just a special case of a^x where x = .5, so if we say sqrt is 1 flop, then that should be the case for arbitrary x. Or how to deal with the low precision hardware version of sin and cos vs. the more exact but slower software implementations? Do they both count as 1?

How to deal with really nasty functions like Bessel or Gamma functions?

I think the only real answer is that flops are useful when you count instructions and compare to the theoretical maximum of the hardware you’re running on and they’re also useful when comparing implementations of the EXACT same algorithm (and even that can get dicey due to accuracy issues). Comparing flop counts of completely different algorithms is basically futile.

seibert · December 14, 2010, 1:47am

Once you start asking the question “What counts as a FLOP?”, it’s time to step back and figure out what it is you actually want to know. Are you trying to estimate how fast a particular piece of code could potentially run?

Topic		Replies	Views
FLOP count CUDA Programming and Performance	3	6642	December 10, 2008
Counting flops what's in and what's out? CUDA Programming and Performance	0	1781	June 9, 2012
Benchmarking a program What is the best option for finding the FLOP for a given thread? CUDA Programming and Performance	10	1189	August 21, 2010
Flop counting CUDA Programming and Performance	1	3823	January 23, 2008
actual FLOP count versus flops_dp metric CUDA Programming and Performance	4	954	August 21, 2024
instruction or operation CUDA Programming and Performance	16	3259	March 28, 2019
Strange FLOP counts CUDA Programming and Performance	21	10082	March 15, 2008
How to calculate the total number of FOP and floating-point performance of special operations(exp sin sqrt)? CUDA Programming and Performance	3	5451	December 26, 2016
Where do all the little FLOPS come from? still dont understand the spec CUDA Programming and Performance	8	18573	February 23, 2007
Maximizing FLOPS CUDA Programming and Performance	4	1371	October 12, 2021

Counting FLOPS...again how much does each operation count?

Related topics