Keper K20x Boasts 1.3 TFLOPS, but let's Compute this Manually

Some background:
it is stated that k20x has 2668 cores. It is also shown that there are 15 stream processors (smx), and that each smx has 192 single-precision cores. However, multiplying 192 x 15 does not yield 2668 cores. Am I missing something here?

Processor
Flops: 1.3 TFLOPS
Cores: 2688
Clock: 732 MHz

Stream Processors (SMX)
15 SMXs
192 SP CUDA Cores*, 64 DP units per SMX
4 Warp Schedulers per SMX
32 threads/Warp
4 Warp Schedules/cycle; 2 Dispatches/schedule
Intr issue: 4 x 2 = 8 intr / cycle (http://tinyurl.com/c3drc6d page 8)

MY CALCULATION of Compuation Density (CD):
CD = f x N x #instr per cycle
CD = freq x (# of cores per SMX x # of SMX) x #instr per cycle
CD_sp = 0.732 GHz x (192 x 15 ) x (8 x 32 bus bandwidth / 32 bit precision) = 8432 GOPS ???
CD_dp = 0.732 GHz x (64 x 15 ) x (8 x 32 bus bandwidth / 64 bit precision) = 2811 GOPS ???

Results obtained dont tally with Nvidia’s reported FLOP. How did they get the numbers? Help needed!

I cannot tell what official document the K20X clock frequency number is from. I will assume for the sake of argument that the number stated above is the correct number.

Then 2688 cores * 732e6 cycles/s * 1 single-precision FMA/core/cycle * 2 floating-point operations/FMA = 3.93532e12 floating-point operations / sec = 3.94 TFLOPS single precision. Double precision TFLOPS would be one third of that, that is, 1.31 TFLOPS double precision. These numbers match up exceedingly well with the performance numbers (1.31 TFLOPS DP, 3.95 TFLOPS SP) stated in the following document:

http://www.nvidia.com/content/tesla/pdf/Tesla-KSeries-Overview-LR.pdf

K20X
14 SMs
192 SP cores/SM
64 DP cores/SM
2668 14 SMs * 192 CUDA cores/SM
2 FLOPS/instruction assuming FMA

0.732 GHz * 14 SM * 192 FP32 FMAs/cycle * 2 ops/FMA = 3.93 SP TFLOPS
0.732 GHz * 14 SM * 64 FP64 DFMAs/cycle * 2 ops/DFMA = 1.31 DP TFLOPS

2668 is specified in Tesla K20X GPU Accelerator Board Specification http://www.nvidia.com/content/PDF/kepler/Tesla-K20X-BD-06397-001-v05.pdf

I can’t find any NVIDIA document that specifically states the number of SMXs in a K20X, but will note that 14 * 192 = 2688, and the latter number is stated in the document I cited above. In addition, page 6 of the following document seems to suggest that the correct number for K20X is 14 SMX:

http://docs.nvidia.com/cuda/samples/6_Advanced/simpleHyperQ/doc/HyperQ.pdf

The Quadro K6000 is listed as having 2880 cores, which is 15 * 192:

http://www.nvidia.com/content/PDF/data-sheet/6606_NV_DS_QuadroK6000_JUL13_NV_A4_LR.pdf

Ugh the K6000 looks like a beast, it hurts just having bought a GTX Titan lol.

I’m putting together a CUDA demo system for my custom application, and now trying to decide for $/flop between:

  1. Tesla K20c,
  2. Dual Tesla K20c
  3. single Quadro K6000

I cannot find the FLOP rating for the K6000 (or its processor clock rate) anywhere from a vendor or Nvidia. I only see it in reviews like Tom’s HW or Anandtech, etc. so I don’t know if that’s what the buyer will get.
How can I find hard FLOP count of the K6000 card so I can make a choice? The 12GByte memory is more than I need.
Thanks for any suggestion.
John