Where do all the little FLOPS come from? still dont understand the spec

G80 has 128 fp32 ALUs at 1350MHz with MADD.

So it should crunch until 256*1.35 GFLOPS (346 GFLOPS) if one feeds it right.

But AFAIR “GF 8800 GPU Technical Brief” document talks about 520 GFLOPS.
Is there some branch-unit or texture-unit ALU added and summed up?

Are this units usable by CUDA C code or only available as one does
texture array access with some interpolation ?

Sorry for all this questions about tech spec details. But I got first to convince
some people that CUDA/G80 is worth the time and effort to port some stuff on it.
Til now this is regarded as “new toy stuff” by some people.

Greetings
Knax

You’re correct, the peak computation rate on G80 is about 346 GFLOPS.

The 520 GFLOPS number quoted in the technical brief includes some graphics-specific operations that are not directly accessible from CUDA (for example, framebuffer blending).

If you include texture lookups, which are available within CUDA, the actual number of floating point operations could be much higher.

I mostly understand the difference between instructions, cycles, and floating-point-operations, but the CUDA programming guide doesn’t tell the whole story. I would like to compare my application’s computing performance to the hardware optimum. I have a 8800GTX, so 346 GFLOPS is obviously optimal (for GPGPU), and GPUBench shows 165 billions scalar instructions maximum.

First, when counting FLOPS, I know that a MAD counts as 2 FLOPS within a single instruction. I also know that an exp is one instruction but must perform a small number of FLOPS. How many, though? On page 49 it states that most instructions 2 cycles to issue, and that more complex instructions like floating-point reciprocal, exp, and sin take 16 cycles. Does that mean that I should count 8 or 16 FLOPS?

Secondly, do the single-precision versions of exp and sin require fewer cycles or count as fewer FLOPS? I see a 15% performance boost using those.

As it is, my app issues about 130 billions instructions per second, so I am about as close to optimum as I expect to get. Calculating cycles per second, though, I get something like 374 billion. What number can I compare that to? Counting memory speed times number of processors only gives 230 billion cycles per second (1.8G*128).

There is more discussion of this topic at http://forums.nvidia.com/index.php?showtopic=28511 and I see that my counting of cycles is wrong. I am still curious about the FLOPS counts for the regular and single-precision trancendental function instructions.

Our practice when counting flops for publications, folding@home numbers, etc. has been to count divisions, sqrts, and even transcendental functions as 1 flop. So we would count both regular and reduced precision functions as the same number of flops (1). This makes it much easier to compare performance across different hardware and also a much more “honest” representation of how much useful work is actually being done.

For example if you need sin(x), that’s really one operation, regardless of how the hardware actually comes up with that value. If ATI does it differently than NVIDIA, do you think the number of flops should change?

This leads to us counting a reciprocal sqrt as 2 flops (even though its one operation) and a sqrt as 1 flop (even though its two operations). But I think this makes sense, its really a quirk of the hardware that it has to calculate the reciprocal sqrt and then take the reciprocal of that to give you just the sqrt.

Thanks! That’s something that I hadn’t seen written anywhere.

That method of counting may lead to flop inflation when a researcher decides to go with an analytic function instead of a trancendental one. I’ve seen far more clever schemes to boost reported performance numbers.

This whole MIPS and FLOPS stuff is about imho misleading. Different
instructions need different cycles. It depends on your application if you
need a lot of cheap MADs or more expensive operations. This is why
real life problems are a better way to measure performance.

Also in practice the compiler is the second part measured beside the
processing unit, as most stuff is not hand optimized assembly code
but some portable Fortran or C/C++ code.

So perhaps we should first port Linpack to CUDA than we can organize
a LAN-party with lots of GF8800 and connect them an try to get into
the TOP500 with this adhoc supercomputer :)) . Position 500 is held
by a cluster with 800 Xeon 3.06Ghz with 2.7 (real) TFLOPS and 4.8
theoretical TFLOPS ( makes 6 GFLOPS max per Xeon).

So with just fourteen 8800GTX you would get the same theoretical TFLOPS.

But maybe they only count fp64 :">

lol, actually that is a very cool idea to do :D

I read the TOP500 FAQ. Linpack for fp64 is used and they don’t like
configs that are only made to get into the list. It must be a system in
real productive use… and I think physics for a distributed game at a
LAN party does not count as real productive use :(