8800GTX:345GFlops or 518GFlops?

Hi. :))
I wonder that finally 8800GTX has 345 GFlops or 518 GFlops when I use it via CUDA.
In other words :
345 = 1.35(Hz) * 128 (SPs) * 2(issue)?
518 = 1.35(Hz) * 128 (SPs) * 3(issue)?

Which is right?
In addition to it, how is the case that the MAD/MUL is performed at the same time?always?

I see that someone say 345GFlops( http://forums.nvidia.com/index.php?showtopic=36286 ) and other someone say 518GFlops on web sites. Of course, I hear that the G80’SP have the ability of peforming the dual issue from nVIDIA. I also read the article. http://forums.nvidia.com/index.php?showtopic=42387

If you know the example of the demonstration of application performing over the 345 GFlops, please tell me. I would like to see it.

The ~300 GFLOPS number is the “correct” one if you only count the scalar processors. The ~500 GFLOPS number comes from including the arithmetic performed by the separate texture units (interpolation and so on) as well. As far as I know, the G80 can only complete (when the pipeline is full) one instruction per clock, but if that instruction is a fused MAD/MUL, then perhaps that counts as two floating point operations.

GFLOPS statistics are all nonsense anyway. :) You’re more likely to be limited by memory bandwidth or memory access patterns than FLOPS in CUDA.

Hi. Thank you, Seibert.

If MAD/MUL is issued at the same time,

it counts 3 floating-point operations(MUL+ADD+MUL) because MAD is MUL+ADD.

Is is right? :)

Yes. I understand that the computational speed is

limited by memory bandwidth than processor speed in almost problems.

However if we don’t recognize the peak performance correctly,

we suffer passively from the performance of the program

although the program have already achieved almost(over 80%) peak performance. :))

For example, if my program achieve 320 GFlops(Peak Performance 345GFlops),

it is successful to bring out GPU’s best but if peak is 518 GFlops it is not.

I think it is just two:

x = a * b + c

Three operands there, but only two operations.

You forgot the ADD part in a MAD (Multiply and Add). It is:

x = [x + a*b] + c (MADD between brackets)

no, mad is

x = a*b+c

mac (multiply accumulate) on the other hand is

x = a*b+x

Both are two operations (hence multiply-add). There is no “x = [x + a*b] + c” instruction on G8x.

Yes. But if it can issue a MAD and a MUL operation (MAD/MUL), that makes three. That’s my point. (I made a mistake, the third operation is not another ADD, but a MUL; and that way you can’t do a double issue anyway, so wrong wrong, forget about my example). So, two floating point operations in a MAD (MAC or whatever, it is a double floating point operation) and a multiplication, makes 3 floating point operations per-clock cicle per core.

3 FLOPS * 1.35 GHz * 128 cores = 518.40 GFLOPS

The problem, I think, is that the double issue is not easy to achieve. I read about it in this forum some time ago (in the early days of CUDA) and you can find some information about the missing MUL in Beyond3D. I don’t really know what the problem is. In some places they say it’s 518 GFLOPs and some other places they say 345 GFLOPs, and the difference arises from counting the double (MAD/MUL) issue or not.

This does not quite match the explanation in CUDA FAQ you link to (which describes Tesla, but it uses the same G80 chip):

Now, one could discuss how to parse the “plus” in that first sentence. Does that mean the units can do multiply-add and sin/cos, but not at the same time (what I suspect) or that cos( a*x+b ) takes only one clock cycle (unlikely, but would be neat)?

Either way, I don’t see any mention of a multiply-add plus another add operation. Also, by “dual issue” are you referring to the pipeline? Each instruction actually takes 2 clock cycles to finish, but they are pipelined to allow one to finish per clock cycle as long as the pipeline is full.

I didn’t link to CUDA FAQ (that was another person), I linked to Beyond3D, and I don’t know how reliable is that source of information for this issues :?

cos( ax + b ) cannot take only one clock cycle if there is not an instruction that actually does cos( ax + b ) (and as far as I know, it does not exist). So those extra GFLOPs can’t come from there (I think).

And yes, there is no mention of a multiply-add plus another add operation (that was my mistake), but you can read something about a double issue of multiply-add plus another mul. I cannot be sure about how this works because a double issue and single retirement does not count as 3 FLOPs (you count retired instructions, not issued instructions). But if the MADD and the MUL are chained (and I’m speculating even more here), in a code like this:

  1. x = a*b + c

  2. y = x*d

  3. x’ = a’*b’ + c’

  4. y’ = x’*d’

  5. x’’ = a’’*b’’ + c’’

  6. y’’ = x’’*d’’

  7. and 3) could be double issued as well as 4) and 5). And 1) would be chained to 2) as well as 3) to 4) and 5) to 6). Chained operations are not retired (their results are forwarded) and you could effectively complete a MUL and a MADD in one clock cycle. In a pipeline you would see something like:

-------- clock

    1. <- three flops, the result of 2 is chained to 5

-------- clock

    1. <- three flops, the result of 4 is chained to 6

-------- clock

6)

Some DSP do this in order to perform a MAC operation taking the same time than a MUL operation.

But as I said, I’m just speculating. The double issue is the only explanation I’ve found to the 518.4 GFLOPs. Sorry I can’t be of more help, but that’s everything I know and it’s everything I can say without peek in the G80 architecture O:)