8800GTX:345GFlops or 518GFlops?

soa · December 7, 2007, 12:28pm

Hi. External Image
I wonder that finally 8800GTX has 345 GFlops or 518 GFlops when I use it via CUDA.
In other words :
345 = 1.35(Hz) * 128 (SPs) * 2(issue)?
518 = 1.35(Hz) * 128 (SPs) * 3(issue)?

Which is right?
In addition to it, how is the case that the MAD/MUL is performed at the same time?always?

I see that someone say 345GFlops( [url=“http://forums.nvidia.com/index.php?showtopic=36286”]http://forums.nvidia.com/index.php?showtopic=36286[/url] ) and other someone say 518GFlops on web sites. Of course, I hear that the G80’SP have the ability of peforming the dual issue from nVIDIA. I also read the article. [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

If you know the example of the demonstration of application performing over the 345 GFlops, please tell me. I would like to see it.

seibert · December 7, 2007, 1:47pm

The ~300 GFLOPS number is the “correct” one if you only count the scalar processors. The ~500 GFLOPS number comes from including the arithmetic performed by the separate texture units (interpolation and so on) as well. As far as I know, the G80 can only complete (when the pipeline is full) one instruction per clock, but if that instruction is a fused MAD/MUL, then perhaps that counts as two floating point operations.

GFLOPS statistics are all nonsense anyway. :) You’re more likely to be limited by memory bandwidth or memory access patterns than FLOPS in CUDA.

soa · December 8, 2007, 2:43am

Hi. Thank you, Seibert.

If MAD/MUL is issued at the same time,

it counts 3 floating-point operations(MUL+ADD+MUL) because MAD is MUL+ADD.

Is is right? :)

Yes. I understand that the computational speed is

limited by memory bandwidth than processor speed in almost problems.

However if we don’t recognize the peak performance correctly,

we suffer passively from the performance of the program

although the program have already achieved almost(over 80%) peak performance. External Media

For example, if my program achieve 320 GFlops(Peak Performance 345GFlops),

it is successful to bring out GPU’s best but if peak is 518 GFlops it is not.

seibert · December 8, 2007, 8:07pm

I think it is just two:

x = a * b + c

Three operands there, but only two operations.

javier1 · December 10, 2007, 3:26pm

You forgot the ADD part in a MAD (Multiply and Add). It is:

x = [x + a*b] + c (MADD between brackets)

wumpus · December 11, 2007, 9:39am

no, mad is

x = a*b+c

mac (multiply accumulate) on the other hand is

x = a*b+x

Both are two operations (hence multiply-add). There is no “x = [x + a*b] + c” instruction on G8x.

javier1 · December 11, 2007, 2:42pm

Yes. But if it can issue a MAD and a MUL operation (MAD/MUL), that makes three. That’s my point. (I made a mistake, the third operation is not another ADD, but a MUL; and that way you can’t do a double issue anyway, so wrong wrong, forget about my example). So, two floating point operations in a MAD (MAC or whatever, it is a double floating point operation) and a multiplication, makes 3 floating point operations per-clock cicle per core.

3 FLOPS * 1.35 GHz * 128 cores = 518.40 GFLOPS

The problem, I think, is that the double issue is not easy to achieve. I read about it in this forum some time ago (in the early days of CUDA) and you can find some information about the missing MUL in Beyond3D. I don’t really know what the problem is. In some places they say it’s 518 GFLOPs and some other places they say 345 GFLOPs, and the difference arises from counting the double (MAD/MUL) issue or not.

seibert · December 11, 2007, 2:58pm

Yes. But if it can issue a MAD and a MUL operation (MAD/MUL), that makes three. That’s my point. (I made a mistake, the third operation is not another ADD, but a MUL; and that way you can’t do a double issue anyway, so wrong wrong, forget about my example). So, two floating point operations in a MAD (MAC or whatever, it is a double floating point operation) and a multiplication, makes 3 floating point operations per-clock cicle per core.

3 FLOPS * 1.35 GHz * 128 cores = 518.40 GFLOPS

The problem, I think, is that the double issue is not easy to achieve. I read about it in this forum some time ago (in the early days of CUDA) and you can find some information about the missing MUL in Beyond3D. I don’t really know what the problem is. In some places they say it’s 518 GFLOPs and some other places they say 345 GFLOPs, and the difference arises from counting the double (MAD/MUL) issue or not.

[snapback]292744[/snapback]

This does not quite match the explanation in CUDA FAQ you link to (which describes Tesla, but it uses the same G80 chip):

Now, one could discuss how to parse the “plus” in that first sentence. Does that mean the units can do multiply-add and sin/cos, but not at the same time (what I suspect) or that cos( a*x+b ) takes only one clock cycle (unlikely, but would be neat)?

Either way, I don’t see any mention of a multiply-add plus another add operation. Also, by “dual issue” are you referring to the pipeline? Each instruction actually takes 2 clock cycles to finish, but they are pipelined to allow one to finish per clock cycle as long as the pipeline is full.

javier1 · December 12, 2007, 11:13am

I didn’t link to CUDA FAQ (that was another person), I linked to Beyond3D, and I don’t know how reliable is that source of information for this issues :?

cos( ax + b ) cannot take only one clock cycle if there is not an instruction that actually does cos( ax + b ) (and as far as I know, it does not exist). So those extra GFLOPs can’t come from there (I think).

And yes, there is no mention of a multiply-add plus another add operation (that was my mistake), but you can read something about a double issue of multiply-add plus another mul. I cannot be sure about how this works because a double issue and single retirement does not count as 3 FLOPs (you count retired instructions, not issued instructions). But if the MADD and the MUL are chained (and I’m speculating even more here), in a code like this:

x = a*b + c
y = x*d
x’ = a’*b’ + c’
y’ = x’*d’
x’’ = a’‘*b’’ + c’’
y’’ = x’‘*d’’
and 3) could be double issued as well as 4) and 5). And 1) would be chained to 2) as well as 3) to 4) and 5) to 6). Chained operations are not retired (their results are forwarded) and you could effectively complete a MUL and a MADD in one clock cycle. In a pipeline you would see something like:

-------- clock

1. ← three flops, the result of 2 is chained to 5

-------- clock

1. ← three flops, the result of 4 is chained to 6

-------- clock

6)

Some DSP do this in order to perform a MAC operation taking the same time than a MUL operation.

But as I said, I’m just speculating. The double issue is the only explanation I’ve found to the 518.4 GFLOPs. Sorry I can’t be of more help, but that’s everything I know and it’s everything I can say without peek in the G80 architecture External Media

Topic		Replies	Views
gigaflops CUDA Programming and Performance	16	16473	September 11, 2008
Missing some GFlops CUDA Programming and Performance	3	2263	December 4, 2007
CUDA vs TESLA GFlop rating CUDA Programming and Performance	3	22396	July 20, 2007
Theoretical FLOP speed Need clarification(s) CUDA Programming and Performance	8	28397	March 19, 2009
GTX285 vs C1060 vs GTX480 GFLOP/s ? CUDA Programming and Performance	1	17322	June 25, 2010
Where do all the little FLOPS come from? still dont understand the spec CUDA Programming and Performance	8	18608	February 23, 2007
GTX280/GT200 GPU Can you really reach 1TFLOP/s? CUDA Programming and Performance	6	10177	June 19, 2008
# of multiprocessors still more silly stuff to ask CUDA Programming and Performance	5	16359	February 24, 2007
clock cycles of double operation CUDA Programming and Performance	9	5125	April 23, 2009
How to compute performance in GFLOPS ? CUDA Programming and Performance	25	12159	November 17, 2008

8800GTX:345GFlops or 518GFlops?

Related topics