K20m theory peak single precision flops by visual profiler seems to be wrong

fingerlake · March 8, 2016, 3:19am

Dear All,
I want to test throughput of FFMA instruction on K20m GPU, so I need to compute theory peak single precision flops of it.

My K20m GPU has 13 SM, each SM has 192 cores, so in total there are 2496 cores.
The frequency by call deviceQuery is 706MHZ.
By the formula:
2*#cores * frequency =22496706= 3524352 Mflops= 24967062/1024/1024 Tflops
comment: 2 means: float multiply and float add operation

However, when I am using visual profiler, I get 3.522 TeraFLOP/s.
My driver is cuda-7.0, My OS is Red Hat Enterprise Linux Server release 6.3 (Santiago)

I do not know how to paste pictures. So I just list a few items of visual profiler here.

Maximums
single precision FLOP/s 3.522TeraFLOP/s
Double Precision FLOP/s 1.174 TeraFLOP/s
Multiprocessor
Multiprocessors 13
Clock Rate 705.5 MHz

So, it seems that visual profiler use 1000 as divider when computing TFLOP/s. Is it a bug?

Here is output of deviceQuery
[zxx@ga87 deviceQuery]$ ./deviceQuery
Device 0: “Tesla K20m”
CUDA Driver Version / Runtime Version 7.0 / 7.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 4800 MBytes (5032706048 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 706 MHz (0.71 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 1310720 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 132 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Tesla K20m
Result = PASS

Thanks,
Xiuxia

njuffa · March 8, 2016, 3:36am

Why are you dividing by 1024? The SI standardized decimal prefixes are all powers of ten: “M” = 106, “G” = 109, “T” 1012, “P” = 1015.

For measuring memory capacity, a different set of binary prefixes has been standardized: “Mi” = 220, “Gi” = 230, “Ti” = 240, “Pi” = 250. See also: [url]https://en.wikipedia.org/wiki/Binary_prefix[/url]

fingerlake · March 8, 2016, 7:58am

Sorry, I was taught in class that M, G, T are in 1024 base.
I am not aware that TFLOP/s is using SI standard.
Thanks njuffa for clarifying it.

Topic		Replies	Views
Peak Performance Computation CUDA Programming and Performance	4	2997	June 13, 2012
flops calculation by profiler / of maximum CUDA Programming and Performance	6	14343	August 7, 2008
GTX280/GT200 GPU Can you really reach 1TFLOP/s? CUDA Programming and Performance	6	10210	June 19, 2008
gigaflops CUDA Programming and Performance	16	16551	September 11, 2008
How to get more Gflops ? :) CUDA Programming and Performance	21	27773	September 12, 2008
Keper K20x Boasts 1.3 TFLOPS, but let's Compute this Manually CUDA Programming and Performance	5	2421	October 7, 2013
Strange FLOP counts CUDA Programming and Performance	21	10259	March 15, 2008
GPU single and double precision FLOPs CUDA Programming and Performance	1	7544	June 16, 2009
GTX285 vs C1060 vs GTX480 GFLOP/s ? CUDA Programming and Performance	1	17341	June 25, 2010
Question about computing GFLOPS Do fabs and a=-b instructions count? CUDA Programming and Performance	13	4620	February 12, 2010

K20m theory peak single precision flops by visual profiler seems to be wrong

Related topics