GPU vs. CPU graphs in documentation

Can someone from NVIDIA clarify figures 1 and 2 in the doc:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

This references Intel Sandy Bridge (the latest Intel), but if you look, there are many
different core counts available on Sandy Bridge:

Further, Intel has their hyperthreading which allows 2 threads per core to run.

So do those graphs show CUDA cards versus a single Intel core on a single thread, or max cores available on Sandy Bridge chips running full hyperthreading?

Also,

  • are there graphs for single precision performance on the Tesla cards?
  • is texture interpolation included in the GPU numbers, or is texture interpolation
    handled by separate hardware (non CUDA cores)

The CUDA numbers in Figure 1 are assuming maximum throughput for the multiply-add instruction (2 floating point operations) at the highest possible clock (GeForce have variable clock rates during processing, while Tesla have a fixed clock rate). The single precision rates for the Tesla cards are generally a little below GeForce cards because the Tesla cards are clocked more conservatively. You can compute the single precision numbers for Tesla yourself by multiplying the # of CUDA cores by the core clock rate * 2. The texture interpolation hardware is separate from the CUDA cores, and functions at a precision lower than single precision, so it is not included in theoretical GFLOPS.

As for the Sandy Bridge numbers, I don’t know how those are calculated since I have not reproduced the numbers myself. Given the single vs. double precision difference, it does look like they are assuming the use of SIMD instructions, which is sensible. If the Sandy Bridge performance is computed assuming the maximum throughput of the SIMD units on all the cores, then that result would be independent of the hyperthreading question.

Maybe if someone can reproduce that number, they can explain further how they got it…