gigaflops

sagrailo · July 22, 2008, 11:17am

I don’t understand exactly how peak gigaflops numbers announced for various NVIDIA products are calculated. I thought at first is simply number of streaming processors (which is in turn number of multiprocessors times 8 processors per multiprocessor) times clock frequency, but I can see numbers announced are bigger. Then I realized these may be calculated based on multiply-add instructions which does 3 floating operations “at once”, and indeed multiplying above by factor of 3 sometimes gives approximate value as announced. However, while searching this forum to check are there any references to this topic, I found an information that multiply-add takes 2 cycles, which would break this calculation, and on the other side there are some number that just don’t match with anything - see for example C870 specification at [url=“http://www.nvidia.com/object/tesla_c870.html”]http://www.nvidia.com/object/tesla_c870.html[/url], where I could not get either to 430 or 512 gigaflops mentioned there by any way… So - any hint on this?

_Big_Mac · July 22, 2008, 12:39pm

AFAIK the “marketing” numbers include calculations made both by streaming processors and by the texture processing unit. We don’t have direct access to TPUs’ processing power via CUDA except for texture filtering (when we bind a texure to cudaArray), normalizing etc… The programming guide says that using interpolation with texture memory is “free”. In reality there are of course calculations associated with this but they are made “behind the scenes” by the TPU so they don’t bog down streaming processors.

If you’re only counting SPs AFAIK you can get max 2 flop per cycle (MADs) and the formula is:
#SM * 8 (#SP per SM) * core frequency in GHz * 2 (#MADs per cycle)

The “marketing” numbers often multiply by 3 instead of 2 accounting for TPU flops.

tmurray · July 22, 2008, 8:36pm

You can get three flops per cycle per SP–MAD+MUL. Hence, the numbers work out.

If you’re really curious about all that, read Rys Sommefeldt’s GT200 architecture overview: [url=“Beyond3D - NVIDIA GT200 GPU and Architecture Analysis”]http://www.beyond3d.com/content/reviews/51[/url]

Or, if you can read French, read Damien Triolet’s piece at hardware.fr.

sagrailo · July 22, 2008, 8:54pm

Thanks for the pointers, interesting read; as a matter of fact, I got confused about gigaflops while working through UIUC CUDA lectures, that I may recommend too for even more detailed, and purely HPC oriented, examination of the GPU architecture (albeit for N80 series)… So I guess we could conclude that 3 * #SM * #SP * freq_GHz is good theoretical performance estimate for max. gigaflops, in order say to be able to compare results of some specific kernel against.

MisterAnderson42 · July 22, 2008, 11:49pm

Don’t forget to count the GB/s of global memory bandwidth you use too. That usually limits the performance of a kernel before you hit the GFLOP limit.

aakova · July 24, 2008, 7:31pm

I’m confused. Beyond3d suggests using a different clock rate, something called the “hot clock” at 1.296Ghz for this calculation, which results in the headline performance for the stock GTX 280 of 933GFLOPS.

If I’m not using operations that come from texture interpolation, is the max FP throughput I can achieve per cycle a MADD (2 FLOPS) or a MADD plus some other op like an ADD or MUL ? Does the answer differ between the 8800GTX and the GTX 280 ?

Is the “hot clock” real or a misunderstanding on the beyond3d writer’s part ?

E.D_Riedijk · July 24, 2008, 7:41pm

Well, from what I understood from the article on anandtech (nice read) G80 was also able to perform a MADD & a MUL at the same time (3 FLOPS per clock) BUT the chances of this happening were really low because of some design mistake. With GT200, the chance of this happening is much, much higher (don’t remember the numbers). So I think effectively you can have something like 2.9 FLOPS per clock, but 3 peak performance.

tmurray · July 24, 2008, 7:43pm

Hot clock is real and is the clock you should use. Each SP is capable of 3 Flops (MAD + MUL) per cycle. Texture operations are not counted towards peak arithmetic throughput.

3 Flop/cycle * 240 SPs * 1.296 GHz = 933 Gflop/s.

MisterAnderson42 · July 24, 2008, 8:01pm

Maybe no longer with the GTX 280, but for older hardware the “marketing number” of 500 GFLOP/s for the 8800 GTX did include the texture interpolation. The CUDA programming guide got the GFLOP/s correct at 340 for the 8800 GTX which just counts one MAD/clock/SP. This may be where some of the confusion is coming in.

tmurray · July 24, 2008, 9:31pm

Eh, yes and no. (Disclaimer: I used to work for Rys at Beyond3D.) The missing MUL wasn’t accessible in graphics in most driver revisions (according to Arun, there was exactly one release where it was enabled, but I think that was a leaked driver), but as far as I know it WAS accessible through CUDA. (technically; getting it to be used consistently, though, was another matter. GTX 280 doesn’t have that problem)

Rys · July 25, 2008, 4:07am

Tim’s right; we never saw the SFU MUL in general shading in graphics mode on a G8x or G9x chip, and the freak result on one driver with one chip was likely testing error. Compute mode is something else, but don’t rely on getting your trifecta of flops per clock per SP on anything but GT200 at this point, especially in graphics mode.

cho · July 25, 2008, 4:33am

my results:

FW 177.26 for Vista x64

D3D9

GeForce GTX 280
MAD_MUL_1D_Issue, 365.661957 B instr/s
=1.5235914875 B instr/s per SP per second
=1.176 instr per SP per cycle

GeForce 9800 GTX
MAD_MUL_1D_Issue, 191.648132 B instr/s
=1.49725103125 B instr/s per SP per second
=0.887 instr per SP per cycle

MisterAnderson42 · July 25, 2008, 1:08pm

I just want to keep the history books straight here. Becuase the difference between the marketing GFLOPS and the MAD GFLOPS for the 8800 GTX was common knowledge on the forums back in the early days of CUDA (is anyone else still around from that time even?)

I know everyone seems to be obsessed with the MAD+MUL thing (I could care less… the calculations I perform hardly use any MADs at all, much less stacking a MUL after every one) but it isn’t the answer to every GFLOPs based question :)

So are you saying the FAQ is wrong?

And are you saying that Simon Green was wrong?

http://forums.nvidia.com/index.php?showtopic=28512&hl=

E.D_Riedijk · July 25, 2008, 5:08pm

Well, I started lurking around about 1.5 years ago. FWIW I also remember very well that I always understood the high GFLOPS number to be because of texture filtering & such.

aakova · July 27, 2008, 8:25pm

How is the HotClock frequency of a GTX 280 determined ? Is it based on the base or memory frequencies or independent, thus something one needs to find in documentation for a given board ?

[edit: This appears to be (variously) called the ALU clock or Shader clock.]

Rys · July 28, 2008, 8:40am

The hot clock is just a 2x multiplier of the scheduler frequency.

alex_dubinsky · September 11, 2008, 6:43am

To be precise, texture filtering itself actually adds up into the teraflops. A single aniso-16x trilinear fetch takes on the order of 100 ops, and fetches can be processed quickly by the TM units if the data’s in the cache. What was being measured was something else. I’d read (on Beyond3D, in fact) it was something like a multiply of the texture fetch result, probably used for advanced alpha blending. The main distinction between it and aniso filtering was presumably that aniso uses pre-determined coefficients while this multiply could use a programmed value. Hence, it wasn’t “special-purpose hardware.”

Topic		Replies	Views
8800GTX:345GFlops or 518GFlops? CUDA Programming and Performance	8	9571	December 12, 2007
some detail-questions for a bachelor-thesis CUDA Programming and Performance	5	10415	December 4, 2010
GTX280/GT200 GPU Can you really reach 1TFLOP/s? CUDA Programming and Performance	6	10151	June 19, 2008
what is the double-precision flops rating of the gtx580? CUDA Programming and Performance	16	33460	April 10, 2014
How to compute performance in GFLOPS ? CUDA Programming and Performance	25	12034	November 17, 2008
Theoretical FLOP speed Need clarification(s) CUDA Programming and Performance	8	28352	March 19, 2009
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37433	August 30, 2009
Mythical Tflops CUDA Programming and Performance	11	1122	January 14, 2019
A maximum performance of 823 GFlops meseared for GTX 295 with mad+muls CUDA Programming and Performance	11	4712	March 8, 2010
How to get more Gflops ? :) CUDA Programming and Performance	21	27598	September 12, 2008

gigaflops

Related topics