I assume the reason why that number is not multiplied by 8 is because ALL (current?) NVidia GPUs have 8 cores per processor (saving a multiplication instruction).

Anyway, did I calculate the following correctly?

9800GT’s FLOP speed is 14 multiprocessors * 8 cores/processor * 1.5GHz per core = 175GFLOPS.

Also, is there any way to get the relative FLOP speed of a CPU in this context? After much searching, that even if there was a similar way to calculate how many MFLOPS/GFLOPS a CPU is capable of, it wouldn’t be an accurate comparison to that of a GPU. Is this correct? Or is there a way to get the value after all?

All CUDA devices can finish a multiply-add instruction (a * b + c) per clock cycle per streaming processor. That gives a maximum of 2 * [# of multiprocessors] * 8 streaming processors/multiprocessor * [shader clock rate] FLOPS. (In your example, I would get 2148*1.5 = 336 GFLOPS.) In addition, the GTX 200 series cards can dual-issue a multiply concurrent with the multiply-add, giving a 50% boost to the theoretical max. (The older cards technically could have done this as well, but hardware limitation meant that they didn’t take advantage of the dual-issue option very often.)

CPUs are complex beasts, and computing their theoretical FLOPS rate is hard. (and is just as ridiculously optimistic as the GPU case) I’ll let someone else answer this… :)

Yes, there is no fundamental difference between the GeForce and Tesla GPUs. The C870-series Tesla had the same GPU as the 8800 GTX, and the Tesla C1060 has the pretty much same GPU as the GTX 285. Only differences seem to be amount of RAM (more), clock rates (slightly less), QA (more), and price (more).

You may reach 2/3 of peak performance easily if you will make some (probably artificial) calculations on registers. We achieve 599 GFlop/s on GTX 280 and 479 GFlop/s on GTX 260 on such examples. It is about 4% less than 2/3 of theoretical speed.

In much real applications with massive BLAS-3 calls it is hardly possible to reach 250GFlop/s on GTX 260. We got it only for one special tensor preconditioner for 3D Laplace on tensor grid:

8800/9800 can supposedly do MADD + MUL in one clock, but it rarely if ever happens and a MADD is all you can count on. So the 3 out front should be a 2 in this calculation. For compute 1.0 and 1.1 hardware. If you want the full history, search the forums with google. It has been discussed many times in gory detail.

The highest GFLOP value available for CUDA program is about f 70% from theoretical value. At least I got ~700GFLOPs on my GTX285 and 222GFLOPs on 8800 GTS 640. This result is reached by synthetic MAD test sequence. See http://cuda-z.sourceforge.net/ for details.