GPU vs CPU theoretical single-precision peak performance

Hi guys,

does someone know where find more recent data like those in Fig. 1.1 of programming guide (I need a graph like that)? I know how to calculate theoretical GFLOPS peak performances of GPU but I don’t know how to calculate it for CPUs.



For a CPU, the convention is usually FLOP/s = number of cores * core frequency * FLOP/ per core per cycle. You will need to consult the specifications and documentation for whatever CPU you are interested in to get the necessary data. For the Intel Conroes I work with quite a bit, we use FLOP/s = 2 * 3.0G * 4 = 24 GFLOP/s per CPU.

Many thanks avidday!


Single or double precision? Because for single precision you should raise your peak estimate to 48 GFLOPS.

Sorry I should have mentioned that is for double precision (ie. HPLinpack) reporting.

So for single precision number of Conroe FLOPS are 8, right? (Sorry for stupid question). Is this number usually called Instruction Per Cycle (IPC)?
Many thanks jma and avidday!

A Core2 will do one 4 element wide MADD - counts as 8 FLOPS - for each core in the CPU

An Atom will do only half of that. Multiply by number of cores and GHz.

And peak performance is just that and especially not real world guestimates. Most of the time you’ll instead be restricted by bandwidth, latencies and/or permutations.

IPC would include what else can be done in parallel like loads and stores and evaluation of loops and branches. For the Core2, this about 4 - where the MADD again counts as two because it really is one MUL followed by one ADD - or if you insist: 6 (peak) . The Atom will do 2 instructions/clock.

For the LINPACK benchmark using Intel MKL, I get about 21 double precision Gflop/s from a theoretical 24 Gflop/s using both cores of an E6850 with DDR2-800 CL5 ram. For a Q9550 I get around 40 Gflop/s out of a theoretical 45.3 Gflop/s with the same DDR2-800 CL5 memory (the Core2 FSB doesn’t scale so well on quad core).

IIRC, Intel Linpack will use single precision wherever they can get away with it without getting caught - which should be fair and within the rules.

It has been ages since I did any serious LINPACK runs on our iron, but as I remember it, I only got about 5% lower performance using a LINPACK build with GotoBLAS compiled with icc/ifc on a single E6850. So if they are “cheating” they aren’t getting much out of doing so.

Clear as day, many thanks!


Actually that would not be within the rules.