what is the double-precision flops rating of the gtx580?

what is the double-precision flops rating of the gtx580?


I haven’t seen an official figure, but I would expect it to be [1.544 GHz] * [512 CUDA Cores] * [2 double precision floating point operations/8 clock cycles] = 198 GFLOPS.

Usual disclaimers apply: Assumes a pure sequence of fused multiply-add operations with no stalls due to memory bandwidth or register dependencies.

Haven’t seen official figures, too, which was why I was asking. But if you compare your calculation to the tesla 2090 that also has the physically exactly same (I assume) GPU, the GF110, and it has 600+ DP Gflops. Also, the newest ATI consumer cards also manage to have 600+ DP gflops. If I remember, the GTX480 (GF100) also has 500+ DP gflops. How can the gtx 580 only have 190?

Peak DP performance of GTX480 is 1334/8 = 168Gflops.

So GTX480 only achieves 166Gflops on DGEMM where C2050 can reach 313Gflops.

Reference: http://www.vizworld.com/2010/04/geforce-gtx-480-18supthsup-double-precision-performance/

All consumer Geforce Fermi cards are limited to 1/4 the double precision throughput of the equivalent Telsa or Quadro card. Seibert’s calculation is correct, and yours is, unfortunately, wrong. The ratio of single precision to double precision performance is 8:1 in consumer GF100 and GF110 cards.

Table 5.1 of the programming guide claims that CC 2.0 devices have the throughput of 16 double-precision operations per clock cycle per multiprocessor. So I’d expect 1.544 * (512/32) * 16 = 396 GFLOPS. Am I wrong?

It’s a shame that nvidia’s cards have 3 times lower double precision flops than Ati’s. Especially for folks (me) who do/want to get the most out of participating in distributed computing projects that rely on double precision calculations.

Two issues:

  • The tables says that the throughput is 16 multiply-add (or just multiply or just add) instructions per clock cycle per multiprocessor. It is traditional in FLOPS-counting on CUDA devices to compute throughput using the multiply-add instruction, which counts as two floating point operations per instruction. That raises your estimate by a factor of 2.

  • The GeForce compute capability 2.0 devices are capped somehow to have 1/4 the double precision speed of their Tesla and Quadro equivalents, as aviday mentioned. (It is unfortunate that NVIDIA chose not to disclose this ambiguity in compute capability 2.0 throughput in the programming guide.) That drops you back down by a factor of 4, to give the number I proposed initially. Basically, GeForce compute capability 2.0 and 2.1 devices have the same double precision throughput, whereas Tesla compute capability 2.0 has the throughput listed in the manual. If a Tesla compute capability 2.1 chip existed, then that column would probably look like the CC 2.0 column.

The GeForce limitations were discussed a lot when Fermi first came out. Here is Sumit Gupta’s list of differences between Tesla and GeForce versions of the Fermi chip:


(I would be thrilled to discover that the programming guide number applies to all CC 2.0 devices thanks to some driver update since that time, but I doubt that.)

OK, I tried to measure FLOPS directly on a 580 and I got 1354 GFLOPS in single precision and 198 GFLOPS in double precision (counting each FMA as two instructions.) The programming guide is misleading.

Let me see what I can get out of a 6970…

edit: after some tweaking and vectorization, 6970 reaches 1970 GFLOPS and 636 GFLOPS single/double precision.

I wonder why the ATI cards don’t perform better in games, since their GFLOP numbers are so much better.

Edit: I know AMD’s stream cores each has 4 fp(or general? i don’t know) processing elements. So this means a maintained ILP depth of 4 is required for games to utilize the cores fully. Maybe this could be a reason. I was also guessing that certain limitations on AMD’s 8-byte VLIW may damage the stream cores’ multi-issuing capability.

Their GFLOP numbers are much better in double precision. Nobody uses double precision in games (up until very recently ATI didn’t even have double precision).

In single precision, the difference between 1970 vs 1354 GFLOPS can be cancelled out by architectural & compiler deficiencies. For example, getting full performance on ATI requires full utilization of all 4 processing units in each VLIW. Unless you consciously vectorize your code, or tweak it going back and forth between your code and the output assembly, you could easily end up with the utilization factor of 0.75, and that eats most of your theoretical advantage in peak FLOPS.

Most scientific applications are memory bandwidth limited, not FLOP limited. So the performance reduction due to using doubles is usually only factor of 2 (due to twice the bytes being loaded/stored).

I tried to benchmark the instruction throughput for single precision floating point and 32 bit integer operations on my GTX580. I’ve got some very weird results:

-32 bit integer multiply-add: 814 GFLOPS

-single precision multiply-add: 1422 GFLOPS

The multiply-add instruction was counted of course as two operations.

Next I tried just the multiply instruction with the following results:

-32 bit integer multiply: 407 GFLOPS

-single precision multiply: 813 GFLOPS

and just the add instruction with the results

-32 bit integer multiply: 407 GFLOPS

-single precision multiply: 812 GFLOPS

As you can see, somehow the throughput of 32-bit integer instructions is roughly half of that of the 32-bit floating point instructions (and thereby half of the throughput reported in 5.4.1 in the Programming Guide) regardless which operation is chosen for benchmarking.

Any idea on this?

Yes I did the exact same test recently. If you read the programming guide ( page ~98 i think ) on instruction throughput you’ll see that multiply-add throughput is only 16 instead of 32 (i.e. it takes 2 instructions to perform integer multiplication addition).

EDIT: Has anyone tested how fast th mul24 is on Fermi arch ?

Take a look at this forum topic: throughput of integer add. Integer addition throughput is also discussed here, as it appears to be half of floating point addition throughput. Unfortunately it turns out to be a bug in nvcc 4.0, so you can try nvcc 3.2 or wait until it is fixed (bug report is already send in).


I know this is an old thread, but bear with me.

Is there any official statement that the GeForce cards of cc 2.0 and 2.1 have the same fp64 throughput? or equivalently that they are capped to 1/4 of the throughput of their Tesla and Quadro equivalents?

I am not disputing that this is the case, but has there been official recognition given somewhere? (other than the ambiguous “(*) Throughput is lower for GeForce GPUs” in the programming guide).