Comparing CPU and GPU Theoretical GFLOPS

Arakageeta · May 21, 2014, 4:35pm

Does anyone know how NVIDIA computes the CPU GFLOPS for the GFLOPS graph at the start of Section 1 in the CUDA Programming Guide? Ivy Bridge perf is shown to be around 650 GFLOPS. However, looking at Intel documentation for E5-2600 series CPUs, GFLOPs is listed < 200 (http://download.intel.com/support/processors/xeon/sb/xeon_E5-2600.pdf).

njuffa · May 21, 2014, 5:37pm

From what I can tell, the most powerful IvyBridge processors have 15 cores running at 2.8 GHz (Xeon E7-2890 v2) where each core can deliver 16 SP or 8 DP results per cycle. If you multiply out those numbers, you get a theoretical throughput of 672 GFLOPS SP, or 336 GFLOPS DP, which seems entirely consistent with the graphs in the Programming Guide.

Robert_Crovella · May 21, 2014, 5:39pm

The 650 number in that graph is a single-precision number. (The corresponding DP number is half as big.)

Ivy Bridge (and Sandy Bridge) support 256-bit AVX and can sustain 16 SP or 8 DP ops per clock per core:

[url]http://www.nas.nasa.gov/hecc/support/kb/Ivy-Bridge-Processors_445.html[/url]

A high-end Ivy Bridge can have up to 12 cores (EP) or even 15 cores (EX), at clock frequencies up to 2.8GHz (EX) or 3.0GHz (EP):

[url]http://en.wikipedia.org/wiki/Ivy_Bridge_(microarchitecture)[/url]

Therefore an Ivy Bridge EX processor with 15 cores, 16 SP flops/cycle, and 2.8GHz would hit a theoretical 672 GFlops/sec (SP)

little_jimmy · May 21, 2014, 6:09pm

And how is this measured on the gpu side?

Surely the serial-implementation versus parallel-implementation paradigm would bias the results in favour of the cpu?
To accurately test, you must let the cpu run serial, and the gpu parallel, not so?

Arakageeta · May 21, 2014, 6:09pm

Thanks. I wonder why Intel reports significantly lower numbers in their export compliance documentation.

little_jimmy · May 21, 2014, 6:30pm

Perhaps pay attention to precision (single versus double), instruction (add, multiply or multiply and add) and SIMD (used, not used) stated in the different measures

Robert_Crovella · May 21, 2014, 7:02pm

GPU theoretical flops calculation is similar conceptually. It will vary by GPU just as the CPU calculation varies by CPU architecture and model.

To use K40m as an example:

[url]http://www.nvidia.com/content/PDF/kepler/Tesla-K40-PCIe-Passive-Board-Spec-BD-06902-001_v05.pdf[/url]

there are 15 SMs (2880/192), each with 64 DP ALUs that are capable of retiring one DP FMA instruction per cycle (== 2 DP Flops per cycle).

15 x 64 x 2 * 745MHz = 1.43 TFlops/sec

which is the stated perf:

[url]Page Not Found | NVIDIA

The change for SP is that there are 192 SP ALUs per SM instead of 64 in the DP case, which yields exactly a tripling (of the DP perf) for SP perf: 3*1.43 = 4.29 TF SP

little_jimmy · May 21, 2014, 7:13pm

txbob:

gpu architecture diagrams hardly show ALUs, just cores, with FPUs and I think “int units”; so when you refer to ALUs, I should think of cores, right?

Also, are these units generally SIMT, or plain single instruction, do you know?

Robert_Crovella · May 21, 2014, 7:34pm

A GPU SM is a collection of various kinds functional units managed by scheduler(s), roughly speaking. My sloppy terminology was not intended to be perfect from a semantic point of view, but merely to communicate a concept in the calculation of peak theoretical performance. The term “core” in my opinion most closely translates to what I called a “SP ALU”. I base this on page 8 of the GK110 white paper, which gives a reasonably good picture of the major functional units and organization, as well as (at least one definition of) NVIDIA terminology used to reference them:

[url]Page Not Found | NVIDIA

txbob: GK110 whitepaper p8:
SP ALU → Core
DP ALU → DP Unit

In the absence of a grand poobah who declares truth, I think it’s reasonable to use “Core” and “DP Unit”, not my sloppy usage of the term “ALU”. If you have other terms you’d like to use, such as FPUs, have at it.

Robert_Crovella · May 21, 2014, 7:46pm

SIMT (single instruction multiple thread) refers to the idea that the scheduler, while processing what is effectively a single instruction stream, will marshall multiple functional units (in some fashion) to perform the activities of “multiple threads” “simultaneously”. (to use a specific example:) 32 “Cores” (i.e. SP ALUs in txbob-speak) might be scheduled together to process a warp’s activity associated with an SP FMA instruction coming from the (single) instruction stream.

For a better description, refer to the programming guide:

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture[/url]

little_jimmy · May 22, 2014, 4:52am

I take no issue with your terminology
And you have provided concrete references, thanks

SIMT is closely associated with/ related to SIMD
So, I take it that the SM ALUs or FPUs are generally SIMT/ SIMD, or not?
Data sheets would normally stipulate x number of cuda cores per SM, for arithmetic instructions
So, can one view the execution of arithmetic instructions by such cores from a SIMD viewpoint, or not (as one can view the execution of arithmetic by the cpu FPU as SIMD, in certain cases(when used))?
In the above, perhaps SIMT/ SIMD should merely be taken to mean single instruction performed on multiple input/ data/ register sets

tera · May 22, 2014, 8:38am

Each single “CUDA core” (or FPU in non-Nvidia terminology) by itself is not SIMT or SIMD.
SIMT is the fact that a single instruction commands 32 (i.e. the warpsize) “CUDA cores” to perform the same operation.

Arakageeta · May 23, 2014, 11:47pm

Intel responded to my request for more information on their export compliance documentation:
“GFLOPs is chip-wide/combined performance… All GFLOPS, CTP and APP calculations contained herein were based on specifications taken from Intel datasheets…”

Intel appears to have a more conservative method for determining theoretical peak GFLOPS. For example, they report that their E5-2690 has a peak performance of 185.6 GFLOPS. However, in order to make fair apples-to-apples comparison among Intel CPUs and NVIDIA GPUs, the more optimistic method described by njuffa and txbob seems necessary (672 GFLOPS for the E5-2690).

njuffa · May 24, 2014, 12:51am

My computation of 336/672 GFLOPS in #2 above was for the Xeon E7-2890 v2, which is an IvyBridge EX-class server CPU. The E5-2690 is a SandyBridge EP-class server CPU with 8 cores running at 2.9 GHz, where each core produces 8/16 results per cycle. Thus theoretical GFLOPS for this CPU are 186/371, and this would appear to match the number you quote from Intel documentation.

When making these comparisons, please look closely at exact processor designations including possible v2 and v3 suffixes.

Arakageeta · May 24, 2014, 3:34am

Cripes! As if Intel didn’t make their naming scheme difficult enough! Thanks for pointing this out.

Topic		Replies	Views
some detail-questions for a bachelor-thesis CUDA Programming and Performance	5	10413	December 4, 2010
Chart GPU vs CPU CUDA Programming and Performance	11	13763	October 15, 2008
GTX 460 - how man angels on the head of a pin how many cores per MP for a GTX 460 - 32 or 48 CUDA Programming and Performance	15	15615	July 18, 2010
what is the double-precision flops rating of the gtx580? CUDA Programming and Performance	16	33452	April 10, 2014
benchmarking GPUs CUDA Programming and Performance	9	17476	September 12, 2008
CUDA book by Kirk & Whu available CUDA Programming and Performance	44	12108	February 10, 2010
How to compute performance in GFLOPS ? CUDA Programming and Performance	25	12025	November 17, 2008
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37334	August 30, 2009
GF100 vs GF104 Performance question CUDA Programming and Performance	18	8920	September 4, 2010
Question about computing GFLOPS Do fabs and a=-b instructions count? CUDA Programming and Performance	13	4474	February 12, 2010

Comparing CPU and GPU Theoretical GFLOPS

Related topics