which program can best measure the speed of GPU float and double float

Which program can best mesure the single floating speed and double floating speed of GPU?

Thanks a lot!!

Which gpu can best measure performance of a program?

Our cards are Tesla C1060

There is no “speed of gpu”. some algorithms could achieve nearly theoretical performance, and some could not, and some could, but the way of thier implementation on gpu is unknown yet.
You may use matrix multiplication routine from cublas to check performance of single and double operations. But really these numbers will be relevant only to matrix multiplications.

The theoretical maximum single precision GFLOPS for a C1060 is 933 and the theoretical maximum double precision GFLOPS is 78. I am not aware of any program or benchmark that can output these values. Seibert’s pdfs code from this thread outputs 622.1 (single precision) “BogoGFLOPS” for the Tesla C1060:

[root@bdgpu-n16 pdfs]# nvcc -arch=sm_13 pdfs.cu -o pdfs_13

[root@bdgpu-n16 pdfs]# ./pdfs_13

Device name: Tesla C1060

BogoGFLOPS: 622.1

Single precision: time = 322.980 ms, efficiency metric = 11.74

Double precision: time = 1642.388 ms, efficiency metric = 2.31

Atomic abuse: time = 3.987 ms, events/sec = 38525.4, events/sec/bogoGFLOP = 61.93

The SDK nbody benchmark gives 394 GFLOP/s:

[root@bdgpu-n16 pdfs]# /usr/local/cuda_sdk/C/bin/linux/release/nbody -benchmark

Run "nbody -benchmark [-n=<numBodies>]" to measure perfomance.

30720 bodies, total time for 100 iterations: 4790.068 ms

= 19.702 billion interactions per second

= 394.031 GFLOP/s at 20 flops per interaction

I also would be interested in having the equivalent an HPL metric for GPUs. I know avidday has worked on this.

Here is an old benchmark that got very near theoretical peak FMAD throughput back in the day: http://forums.nvidia.com/index.php?showtop…mp;#entry250179
I haven’t tested it on recent architectures.

edit: pasted correct link

Note that I called them “BogoGFLOPS” because I computed those values by doing 2 * [shader clock rate] * [# of stream processors]. It’s not a benchmark, just some math to normalize results from different cards. I don’t believe the dual-issue MUL is that useful in the GT200, which is why I use a factor of 2 there and not 3.

Yeah, that is was a marketing trick back then, and now is a bit of a marketing problem. SP performance only went from 933 to ±1200 GFLOPS according to an nvidia person I spoke to a while ago while for non-benchmark algorithms, the achievable peak went from 622 to 1200, or almost a doubling :)

Denis - what do you mean by “non-benchmark algorithms” ?

This is something I wondered and guess I’ll have to wait for a C2050 to test it - if Tesla supposed to have 933GFLOPS and

Fermi should deliver ~X2 performance how can it be that the specs says 1.03TFLOPS?

Marketing stuff?

thanks

eyal

C1060 delivers 933 benchmark Gflops (you can make a benchmark program that achieves very close to this number by doing a MAD, followed by a MUL (lots of them))

C0160 delivers 622 Gflops if you do just MADs, so that is the best you can do in a normal program (I don’t think a lot of people ever have hit the MAD+MUL bonus in real life)

S2050 delivers almost twice those 622 GFLOPS (they don’t speak about the dual-issuing of a MAD and MUL anymore).

So the GFLOPS number of S2050 is a lot more achievable in real life than the one from C1060, and in practise almost a doubling, in marketing speak it is only a mild gain.