Does anyone know how NVIDIA computes the CPU GFLOPS for the GFLOPS graph at the start of Section 1 in the CUDA Programming Guide? Ivy Bridge perf is shown to be around 650 GFLOPS. However, looking at Intel documentation for E5-2600 series CPUs, GFLOPs is listed < 200 (http://download.intel.com/support/processors/xeon/sb/xeon_E5-2600.pdf).
From what I can tell, the most powerful IvyBridge processors have 15 cores running at 2.8 GHz (Xeon E7-2890 v2) where each core can deliver 16 SP or 8 DP results per cycle. If you multiply out those numbers, you get a theoretical throughput of 672 GFLOPS SP, or 336 GFLOPS DP, which seems entirely consistent with the graphs in the Programming Guide.
The 650 number in that graph is a single-precision number. (The corresponding DP number is half as big.)
Ivy Bridge (and Sandy Bridge) support 256-bit AVX and can sustain 16 SP or 8 DP ops per clock per core:
A high-end Ivy Bridge can have up to 12 cores (EP) or even 15 cores (EX), at clock frequencies up to 2.8GHz (EX) or 3.0GHz (EP):
Therefore an Ivy Bridge EX processor with 15 cores, 16 SP flops/cycle, and 2.8GHz would hit a theoretical 672 GFlops/sec (SP)
And how is this measured on the gpu side?
Surely the serial-implementation versus parallel-implementation paradigm would bias the results in favour of the cpu?
To accurately test, you must let the cpu run serial, and the gpu parallel, not so?
Thanks. I wonder why Intel reports significantly lower numbers in their export compliance documentation.
Perhaps pay attention to precision (single versus double), instruction (add, multiply or multiply and add) and SIMD (used, not used) stated in the different measures
GPU theoretical flops calculation is similar conceptually. It will vary by GPU just as the CPU calculation varies by CPU architecture and model.
To use K40m as an example:
there are 15 SMs (2880/192), each with 64 DP ALUs that are capable of retiring one DP FMA instruction per cycle (== 2 DP Flops per cycle).
15 x 64 x 2 * 745MHz = 1.43 TFlops/sec
which is the stated perf:
The change for SP is that there are 192 SP ALUs per SM instead of 64 in the DP case, which yields exactly a tripling (of the DP perf) for SP perf: 3*1.43 = 4.29 TF SP
gpu architecture diagrams hardly show ALUs, just cores, with FPUs and I think “int units”; so when you refer to ALUs, I should think of cores, right?
Also, are these units generally SIMT, or plain single instruction, do you know?
A GPU SM is a collection of various kinds functional units managed by scheduler(s), roughly speaking. My sloppy terminology was not intended to be perfect from a semantic point of view, but merely to communicate a concept in the calculation of peak theoretical performance. The term “core” in my opinion most closely translates to what I called a “SP ALU”. I base this on page 8 of the GK110 white paper, which gives a reasonably good picture of the major functional units and organization, as well as (at least one definition of) NVIDIA terminology used to reference them:
txbob: GK110 whitepaper p8:
SP ALU -> Core
DP ALU -> DP Unit
In the absence of a grand poobah who declares truth, I think it’s reasonable to use “Core” and “DP Unit”, not my sloppy usage of the term “ALU”. If you have other terms you’d like to use, such as FPUs, have at it.
SIMT (single instruction multiple thread) refers to the idea that the scheduler, while processing what is effectively a single instruction stream, will marshall multiple functional units (in some fashion) to perform the activities of “multiple threads” “simultaneously”. (to use a specific example:) 32 “Cores” (i.e. SP ALUs in txbob-speak) might be scheduled together to process a warp’s activity associated with an SP FMA instruction coming from the (single) instruction stream.
For a better description, refer to the programming guide:
I take no issue with your terminology
And you have provided concrete references, thanks
SIMT is closely associated with/ related to SIMD
So, I take it that the SM ALUs or FPUs are generally SIMT/ SIMD, or not?
Data sheets would normally stipulate x number of cuda cores per SM, for arithmetic instructions
So, can one view the execution of arithmetic instructions by such cores from a SIMD viewpoint, or not (as one can view the execution of arithmetic by the cpu FPU as SIMD, in certain cases(when used))?
In the above, perhaps SIMT/ SIMD should merely be taken to mean single instruction performed on multiple input/ data/ register sets
Each single “CUDA core” (or FPU in non-Nvidia terminology) by itself is not SIMT or SIMD.
SIMT is the fact that a single instruction commands 32 (i.e. the warpsize) “CUDA cores” to perform the same operation.
Intel responded to my request for more information on their export compliance documentation:
“GFLOPs is chip-wide/combined performance… All GFLOPS, CTP and APP calculations contained herein were based on specifications taken from Intel datasheets…”
Intel appears to have a more conservative method for determining theoretical peak GFLOPS. For example, they report that their E5-2690 has a peak performance of 185.6 GFLOPS. However, in order to make fair apples-to-apples comparison among Intel CPUs and NVIDIA GPUs, the more optimistic method described by njuffa and txbob seems necessary (672 GFLOPS for the E5-2690).
My computation of 336/672 GFLOPS in #2 above was for the Xeon E7-2890 v2, which is an IvyBridge EX-class server CPU. The E5-2690 is a SandyBridge EP-class server CPU with 8 cores running at 2.9 GHz, where each core produces 8/16 results per cycle. Thus theoretical GFLOPS for this CPU are 186/371, and this would appear to match the number you quote from Intel documentation.
When making these comparisons, please look closely at exact processor designations including possible v2 and v3 suffixes.
Cripes! As if Intel didn’t make their naming scheme difficult enough! Thanks for pointing this out.