GPU vs CPU theoretical single-precision peak performance

CaLu · November 23, 2009, 1:27pm

Hi guys,

does someone know where find more recent data like those in Fig. 1.1 of programming guide (I need a graph like that)? I know how to calculate theoretical GFLOPS peak performances of GPU but I don’t know how to calculate it for CPUs.

Cheers,

luca

avidday · November 23, 2009, 1:36pm

For a CPU, the convention is usually FLOP/s = number of cores * core frequency * FLOP/ per core per cycle. You will need to consult the specifications and documentation for whatever CPU you are interested in to get the necessary data. For the Intel Conroes I work with quite a bit, we use FLOP/s = 2 * 3.0G * 4 = 24 GFLOP/s per CPU.

CaLu · November 23, 2009, 2:09pm

Many thanks avidday!
cheers,

luca

jma · November 23, 2009, 6:16pm

Single or double precision? Because for single precision you should raise your peak estimate to 48 GFLOPS.

avidday · November 23, 2009, 6:22pm

Sorry I should have mentioned that is for double precision (ie. HPLinpack) reporting.

CaLu · November 23, 2009, 9:37pm

So for single precision number of Conroe FLOPS are 8, right? (Sorry for stupid question). Is this number usually called Instruction Per Cycle (IPC)?
Many thanks jma and avidday!

jma · November 23, 2009, 9:59pm

A Core2 will do one 4 element wide MADD - counts as 8 FLOPS - for each core in the CPU

An Atom will do only half of that. Multiply by number of cores and GHz.

And peak performance is just that and especially not real world guestimates. Most of the time you’ll instead be restricted by bandwidth, latencies and/or permutations.

IPC would include what else can be done in parallel like loads and stores and evaluation of loops and branches. For the Core2, this about 4 - where the MADD again counts as two because it really is one MUL followed by one ADD - or if you insist: 6 (peak) . The Atom will do 2 instructions/clock.

avidday · November 23, 2009, 10:12pm

For the LINPACK benchmark using Intel MKL, I get about 21 double precision Gflop/s from a theoretical 24 Gflop/s using both cores of an E6850 with DDR2-800 CL5 ram. For a Q9550 I get around 40 Gflop/s out of a theoretical 45.3 Gflop/s with the same DDR2-800 CL5 memory (the Core2 FSB doesn’t scale so well on quad core).

jma · November 23, 2009, 10:26pm

IIRC, Intel Linpack will use single precision wherever they can get away with it without getting caught - which should be fair and within the rules.

avidday · November 23, 2009, 10:30pm

It has been ages since I did any serious LINPACK runs on our iron, but as I remember it, I only got about 5% lower performance using a LINPACK build with GotoBLAS compiled with icc/ifc on a single E6850. So if they are “cheating” they aren’t getting much out of doing so.

CaLu · November 23, 2009, 10:31pm

Clear as day, many thanks!
Cheers,

luca

theMarix · November 26, 2009, 4:30pm

Actually that would not be within the rules.