Where do all the little FLOPS come from? still dont understand the spec

Knaxkopp · February 19, 2007, 2:42pm

G80 has 128 fp32 ALUs at 1350MHz with MADD.

So it should crunch until 256*1.35 GFLOPS (346 GFLOPS) if one feeds it right.

But AFAIR “GF 8800 GPU Technical Brief” document talks about 520 GFLOPS.
Is there some branch-unit or texture-unit ALU added and summed up?

Are this units usable by CUDA C code or only available as one does
texture array access with some interpolation ?

Sorry for all this questions about tech spec details. But I got first to convince
some people that CUDA/G80 is worth the time and effort to port some stuff on it.
Til now this is regarded as “new toy stuff” by some people.

Greetings
Knax

Simon_Green · February 21, 2007, 3:28pm

You’re correct, the peak computation rate on G80 is about 346 GFLOPS.

The 520 GFLOPS number quoted in the technical brief includes some graphics-specific operations that are not directly accessible from CUDA (for example, framebuffer blending).

If you include texture lookups, which are available within CUDA, the actual number of floating point operations could be much higher.

mstock · February 22, 2007, 4:49pm

I mostly understand the difference between instructions, cycles, and floating-point-operations, but the CUDA programming guide doesn’t tell the whole story. I would like to compare my application’s computing performance to the hardware optimum. I have a 8800GTX, so 346 GFLOPS is obviously optimal (for GPGPU), and GPUBench shows 165 billions scalar instructions maximum.

First, when counting FLOPS, I know that a MAD counts as 2 FLOPS within a single instruction. I also know that an exp is one instruction but must perform a small number of FLOPS. How many, though? On page 49 it states that most instructions 2 cycles to issue, and that more complex instructions like floating-point reciprocal, exp, and sin take 16 cycles. Does that mean that I should count 8 or 16 FLOPS?

Secondly, do the single-precision versions of exp and sin require fewer cycles or count as fewer FLOPS? I see a 15% performance boost using those.

As it is, my app issues about 130 billions instructions per second, so I am about as close to optimum as I expect to get. Calculating cycles per second, though, I get something like 374 billion. What number can I compare that to? Counting memory speed times number of processors only gives 230 billion cycles per second (1.8G*128).

mstock · February 22, 2007, 4:58pm

There is more discussion of this topic at [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA and I see that my counting of cycles is wrong. I am still curious about the FLOPS counts for the regular and single-precision trancendental function instructions.

eelsen · February 22, 2007, 9:29pm

Our practice when counting flops for publications, folding@home numbers, etc. has been to count divisions, sqrts, and even transcendental functions as 1 flop. So we would count both regular and reduced precision functions as the same number of flops (1). This makes it much easier to compare performance across different hardware and also a much more “honest” representation of how much useful work is actually being done.

For example if you need sin(x), that’s really one operation, regardless of how the hardware actually comes up with that value. If ATI does it differently than NVIDIA, do you think the number of flops should change?

This leads to us counting a reciprocal sqrt as 2 flops (even though its one operation) and a sqrt as 1 flop (even though its two operations). But I think this makes sense, its really a quirk of the hardware that it has to calculate the reciprocal sqrt and then take the reciprocal of that to give you just the sqrt.

mstock · February 22, 2007, 9:46pm

Thanks! That’s something that I hadn’t seen written anywhere.

That method of counting may lead to flop inflation when a researcher decides to go with an analytic function instead of a trancendental one. I’ve seen far more clever schemes to boost reported performance numbers.

Knaxkopp · February 23, 2007, 9:58am

This whole MIPS and FLOPS stuff is about imho misleading. Different
instructions need different cycles. It depends on your application if you
need a lot of cheap MADs or more expensive operations. This is why
real life problems are a better way to measure performance.

Also in practice the compiler is the second part measured beside the
processing unit, as most stuff is not hand optimized assembly code
but some portable Fortran or C/C++ code.

So perhaps we should first port Linpack to CUDA than we can organize
a LAN-party with lots of GF8800 and connect them an try to get into
the TOP500 with this adhoc supercomputer External Image . Position 500 is held
by a cluster with 800 Xeon 3.06Ghz with 2.7 (real) TFLOPS and 4.8
theoretical TFLOPS ( makes 6 GFLOPS max per Xeon).

So with just fourteen 8800GTX you would get the same theoretical TFLOPS.

But maybe they only count fp64 :">

Wouter_Wiggers · February 23, 2007, 10:01am

lol, actually that is a very cool idea to do :D

Knaxkopp · February 23, 2007, 10:40am

I read the TOP500 FAQ. Linpack for fp64 is used and they don’t like
configs that are only made to get into the list. It must be a system in
real productive use… and I think physics for a distributed game at a
LAN party does not count as real productive use :(

Topic		Replies	Views
what is the double-precision flops rating of the gtx580? CUDA Programming and Performance	16	33456	April 10, 2014
How to compute performance in GFLOPS ? CUDA Programming and Performance	25	12028	November 17, 2008
8800GTX:345GFlops or 518GFlops? CUDA Programming and Performance	8	9565	December 12, 2007
Question about computing GFLOPS Do fabs and a=-b instructions count? CUDA Programming and Performance	13	4475	February 12, 2010
gigaflops CUDA Programming and Performance	16	16413	September 11, 2008
Theoretical FLOP speed Need clarification(s) CUDA Programming and Performance	8	28350	March 19, 2009
Missing some GFlops CUDA Programming and Performance	3	2243	December 4, 2007
benchmarking GPUs CUDA Programming and Performance	9	17476	September 12, 2008
FLOP count CUDA Programming and Performance	3	6628	December 10, 2008
Mythical Tflops CUDA Programming and Performance	11	1111	January 14, 2019

Where do all the little FLOPS come from? still dont understand the spec

Related topics