I don’t understand exactly how peak gigaflops numbers announced for various NVIDIA products are calculated. I thought at first is simply number of streaming processors (which is in turn number of multiprocessors times 8 processors per multiprocessor) times clock frequency, but I can see numbers announced are bigger. Then I realized these may be calculated based on multiply-add instructions which does 3 floating operations “at once”, and indeed multiplying above by factor of 3 sometimes gives approximate value as announced. However, while searching this forum to check are there any references to this topic, I found an information that multiply-add takes 2 cycles, which would break this calculation, and on the other side there are some number that just don’t match with anything - see for example C870 specification at http://www.nvidia.com/object/tesla_c870.html, where I could not get either to 430 or 512 gigaflops mentioned there by any way… So - any hint on this?
AFAIK the “marketing” numbers include calculations made both by streaming processors and by the texture processing unit. We don’t have direct access to TPUs’ processing power via CUDA except for texture filtering (when we bind a texure to cudaArray), normalizing etc… The programming guide says that using interpolation with texture memory is “free”. In reality there are of course calculations associated with this but they are made “behind the scenes” by the TPU so they don’t bog down streaming processors.
If you’re only counting SPs AFAIK you can get max 2 flop per cycle (MADs) and the formula is:
#SM * 8 (#SP per SM) * core frequency in GHz * 2 (#MADs per cycle)
The “marketing” numbers often multiply by 3 instead of 2 accounting for TPU flops.
You can get three flops per cycle per SP–MAD+MUL. Hence, the numbers work out.
If you’re really curious about all that, read Rys Sommefeldt’s GT200 architecture overview: http://www.beyond3d.com/content/reviews/51
Or, if you can read French, read Damien Triolet’s piece at hardware.fr.
Thanks for the pointers, interesting read; as a matter of fact, I got confused about gigaflops while working through UIUC CUDA lectures, that I may recommend too for even more detailed, and purely HPC oriented, examination of the GPU architecture (albeit for N80 series)… So I guess we could conclude that 3 * #SM * #SP * freq_GHz is good theoretical performance estimate for max. gigaflops, in order say to be able to compare results of some specific kernel against.
Don’t forget to count the GB/s of global memory bandwidth you use too. That usually limits the performance of a kernel before you hit the GFLOP limit.
I’m confused. Beyond3d suggests using a different clock rate, something called the “hot clock” at 1.296Ghz for this calculation, which results in the headline performance for the stock GTX 280 of 933GFLOPS.
If I’m not using operations that come from texture interpolation, is the max FP throughput I can achieve per cycle a MADD (2 FLOPS) or a MADD plus some other op like an ADD or MUL ? Does the answer differ between the 8800GTX and the GTX 280 ?
Is the “hot clock” real or a misunderstanding on the beyond3d writer’s part ?
Well, from what I understood from the article on anandtech (nice read) G80 was also able to perform a MADD & a MUL at the same time (3 FLOPS per clock) BUT the chances of this happening were really low because of some design mistake. With GT200, the chance of this happening is much, much higher (don’t remember the numbers). So I think effectively you can have something like 2.9 FLOPS per clock, but 3 peak performance.
Hot clock is real and is the clock you should use. Each SP is capable of 3 Flops (MAD + MUL) per cycle. Texture operations are not counted towards peak arithmetic throughput.
3 Flop/cycle * 240 SPs * 1.296 GHz = 933 Gflop/s.
Maybe no longer with the GTX 280, but for older hardware the “marketing number” of 500 GFLOP/s for the 8800 GTX did include the texture interpolation. The CUDA programming guide got the GFLOP/s correct at 340 for the 8800 GTX which just counts one MAD/clock/SP. This may be where some of the confusion is coming in.
Eh, yes and no. (Disclaimer: I used to work for Rys at Beyond3D.) The missing MUL wasn’t accessible in graphics in most driver revisions (according to Arun, there was exactly one release where it was enabled, but I think that was a leaked driver), but as far as I know it WAS accessible through CUDA. (technically; getting it to be used consistently, though, was another matter. GTX 280 doesn’t have that problem)
Tim’s right; we never saw the SFU MUL in general shading in graphics mode on a G8x or G9x chip, and the freak result on one driver with one chip was likely testing error. Compute mode is something else, but don’t rely on getting your trifecta of flops per clock per SP on anything but GT200 at this point, especially in graphics mode.
FW 177.26 for Vista x64
GeForce GTX 280
MAD_MUL_1D_Issue, 365.661957 B instr/s
=1.5235914875 B instr/s per SP per second
=1.176 instr per SP per cycle
GeForce 9800 GTX
MAD_MUL_1D_Issue, 191.648132 B instr/s
=1.49725103125 B instr/s per SP per second
=0.887 instr per SP per cycle
I just want to keep the history books straight here. Becuase the difference between the marketing GFLOPS and the MAD GFLOPS for the 8800 GTX was common knowledge on the forums back in the early days of CUDA (is anyone else still around from that time even?)
I know everyone seems to be obsessed with the MAD+MUL thing (I could care less… the calculations I perform hardly use any MADs at all, much less stacking a MUL after every one) but it isn’t the answer to every GFLOPs based question :)
So are you saying the FAQ is wrong?
And are you saying that Simon Green was wrong?
Well, I started lurking around about 1.5 years ago. FWIW I also remember very well that I always understood the high GFLOPS number to be because of texture filtering & such.
How is the HotClock frequency of a GTX 280 determined ? Is it based on the base or memory frequencies or independent, thus something one needs to find in documentation for a given board ?
[edit: This appears to be (variously) called the ALU clock or Shader clock.]
The hot clock is just a 2x multiplier of the scheduler frequency.
To be precise, texture filtering itself actually adds up into the teraflops. A single aniso-16x trilinear fetch takes on the order of 100 ops, and fetches can be processed quickly by the TM units if the data’s in the cache. What was being measured was something else. I’d read (on Beyond3D, in fact) it was something like a multiply of the texture fetch result, probably used for advanced alpha blending. The main distinction between it and aniso filtering was presumably that aniso uses pre-determined coefficients while this multiply could use a programmed value. Hence, it wasn’t “special-purpose hardware.”