# Keper K20x Boasts 1.3 TFLOPS, but let's Compute this Manually

Some background:
it is stated that k20x has 2668 cores. It is also shown that there are 15 stream processors (smx), and that each smx has 192 single-precision cores. However, multiplying 192 x 15 does not yield 2668 cores. Am I missing something here?

Processor
Flops: 1.3 TFLOPS
Cores: 2688
Clock: 732 MHz

Stream Processors (SMX)
15 SMXs
192 SP CUDA Cores*, 64 DP units per SMX
4 Warp Schedulers per SMX
4 Warp Schedules/cycle; 2 Dispatches/schedule
Intr issue: 4 x 2 = 8 intr / cycle (http://tinyurl.com/c3drc6d page 8)

MY CALCULATION of Compuation Density (CD):
CD = f x N x #instr per cycle
CD = freq x (# of cores per SMX x # of SMX) x #instr per cycle
CD_sp = 0.732 GHz x (192 x 15 ) x (8 x 32 bus bandwidth / 32 bit precision) = 8432 GOPS ???
CD_dp = 0.732 GHz x (64 x 15 ) x (8 x 32 bus bandwidth / 64 bit precision) = 2811 GOPS ???

Results obtained dont tally with Nvidia’s reported FLOP. How did they get the numbers? Help needed!

I cannot tell what official document the K20X clock frequency number is from. I will assume for the sake of argument that the number stated above is the correct number.

Then 2688 cores * 732e6 cycles/s * 1 single-precision FMA/core/cycle * 2 floating-point operations/FMA = 3.93532e12 floating-point operations / sec = 3.94 TFLOPS single precision. Double precision TFLOPS would be one third of that, that is, 1.31 TFLOPS double precision. These numbers match up exceedingly well with the performance numbers (1.31 TFLOPS DP, 3.95 TFLOPS SP) stated in the following document:

http://www.nvidia.com/content/tesla/pdf/Tesla-KSeries-Overview-LR.pdf

K20X
14 SMs
192 SP cores/SM
64 DP cores/SM
2668 14 SMs * 192 CUDA cores/SM
2 FLOPS/instruction assuming FMA

0.732 GHz * 14 SM * 192 FP32 FMAs/cycle * 2 ops/FMA = 3.93 SP TFLOPS
0.732 GHz * 14 SM * 64 FP64 DFMAs/cycle * 2 ops/DFMA = 1.31 DP TFLOPS

2668 is specified in Tesla K20X GPU Accelerator Board Specification http://www.nvidia.com/content/PDF/kepler/Tesla-K20X-BD-06397-001-v05.pdf

I can’t find any NVIDIA document that specifically states the number of SMXs in a K20X, but will note that 14 * 192 = 2688, and the latter number is stated in the document I cited above. In addition, page 6 of the following document seems to suggest that the correct number for K20X is 14 SMX:

The Quadro K6000 is listed as having 2880 cores, which is 15 * 192: