K20m theory peak single precision flops by visual profiler seems to be wrong

Dear All,
I want to test throughput of FFMA instruction on K20m GPU, so I need to compute theory peak single precision flops of it.

My K20m GPU has 13 SM, each SM has 192 cores, so in total there are 2496 cores.
The frequency by call deviceQuery is 706MHZ.
By the formula:
2*#cores * frequency =22496706= 3524352 Mflops= 24967062/1024/1024 Tflops
comment: 2 means: float multiply and float add operation

However, when I am using visual profiler, I get 3.522 TeraFLOP/s.
My driver is cuda-7.0, My OS is Red Hat Enterprise Linux Server release 6.3 (Santiago)

I do not know how to paste pictures. So I just list a few items of visual profiler here.

single precision FLOP/s 3.522TeraFLOP/s
Double Precision FLOP/s 1.174 TeraFLOP/s
Multiprocessors 13
Clock Rate 705.5 MHz

So, it seems that visual profiler use 1000 as divider when computing TFLOP/s. Is it a bug?

Here is output of deviceQuery
[zxx@ga87 deviceQuery]$ ./deviceQuery
Device 0: “Tesla K20m”
CUDA Driver Version / Runtime Version 7.0 / 7.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 4800 MBytes (5032706048 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 706 MHz (0.71 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 1310720 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 132 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Tesla K20m
Result = PASS


Why are you dividing by 1024? The SI standardized decimal prefixes are all powers of ten: “M” = 106, “G” = 109, “T” 1012, “P” = 1015.

For measuring memory capacity, a different set of binary prefixes has been standardized: “Mi” = 220, “Gi” = 230, “Ti” = 240, “Pi” = 250. See also: https://en.wikipedia.org/wiki/Binary_prefix

Sorry, I was taught in class that M, G, T are in 1024 base.
I am not aware that TFLOP/s is using SI standard.
Thanks njuffa for clarifying it.