The ~300 GFLOPS number is the “correct” one if you only count the scalar processors. The ~500 GFLOPS number comes from including the arithmetic performed by the separate texture units (interpolation and so on) as well. As far as I know, the G80 can only complete (when the pipeline is full) one instruction per clock, but if that instruction is a fused MAD/MUL, then perhaps that counts as two floating point operations.
GFLOPS statistics are all nonsense anyway. :) You’re more likely to be limited by memory bandwidth or memory access patterns than FLOPS in CUDA.
Yes. But if it can issue a MAD and a MUL operation (MAD/MUL), that makes three. That’s my point. (I made a mistake, the third operation is not another ADD, but a MUL; and that way you can’t do a double issue anyway, so wrong wrong, forget about my example). So, two floating point operations in a MAD (MAC or whatever, it is a double floating point operation) and a multiplication, makes 3 floating point operations per-clock cicle per core.
3 FLOPS * 1.35 GHz * 128 cores = 518.40 GFLOPS
The problem, I think, is that the double issue is not easy to achieve. I read about it in this forum some time ago (in the early days of CUDA) and you can find some information about the missing MUL in Beyond3D. I don’t really know what the problem is. In some places they say it’s 518 GFLOPs and some other places they say 345 GFLOPs, and the difference arises from counting the double (MAD/MUL) issue or not.
This does not quite match the explanation in CUDA FAQ you link to (which describes Tesla, but it uses the same G80 chip):
Now, one could discuss how to parse the “plus” in that first sentence. Does that mean the units can do multiply-add and sin/cos, but not at the same time (what I suspect) or that cos( a*x+b ) takes only one clock cycle (unlikely, but would be neat)?
Either way, I don’t see any mention of a multiply-add plus another add operation. Also, by “dual issue” are you referring to the pipeline? Each instruction actually takes 2 clock cycles to finish, but they are pipelined to allow one to finish per clock cycle as long as the pipeline is full.
I didn’t link to CUDA FAQ (that was another person), I linked to Beyond3D, and I don’t know how reliable is that source of information for this issues :?
cos( ax + b ) cannot take only one clock cycle if there is not an instruction that actually does cos( ax + b ) (and as far as I know, it does not exist). So those extra GFLOPs can’t come from there (I think).
And yes, there is no mention of a multiply-add plus another add operation (that was my mistake), but you can read something about a double issue of multiply-add plus another mul. I cannot be sure about how this works because a double issue and single retirement does not count as 3 FLOPs (you count retired instructions, not issued instructions). But if the MADD and the MUL are chained (and I’m speculating even more here), in a code like this:
x = a*b + c
y = x*d
x’ = a’*b’ + c’
y’ = x’*d’
x’’ = a’’*b’’ + c’’
y’’ = x’’*d’’
and 3) could be double issued as well as 4) and 5). And 1) would be chained to 2) as well as 3) to 4) and 5) to 6). Chained operations are not retired (their results are forwarded) and you could effectively complete a MUL and a MADD in one clock cycle. In a pipeline you would see something like:
<- three flops, the result of 2 is chained to 5
<- three flops, the result of 4 is chained to 6
Some DSP do this in order to perform a MAC operation taking the same time than a MUL operation.
But as I said, I’m just speculating. The double issue is the only explanation I’ve found to the 518.4 GFLOPs. Sorry I can’t be of more help, but that’s everything I know and it’s everything I can say without peek in the G80 architecture