what is the double-precision flops rating of the gtx580?

taid · May 30, 2011, 9:16pm

thanks
T

seibert · May 30, 2011, 11:07pm

I haven’t seen an official figure, but I would expect it to be [1.544 GHz] * [512 CUDA Cores] * [2 double precision floating point operations/8 clock cycles] = 198 GFLOPS.

Usual disclaimers apply: Assumes a pure sequence of fused multiply-add operations with no stalls due to memory bandwidth or register dependencies.

taid · May 31, 2011, 1:59am

Haven’t seen official figures, too, which was why I was asking. But if you compare your calculation to the tesla 2090 that also has the physically exactly same (I assume) GPU, the GF110, and it has 600+ DP Gflops. Also, the newest ATI consumer cards also manage to have 600+ DP gflops. If I remember, the GTX480 (GF100) also has 500+ DP gflops. How can the gtx 580 only have 190?

LSChien · May 31, 2011, 4:14am

Peak DP performance of GTX480 is 1334/8 = 168Gflops.

So GTX480 only achieves 166Gflops on DGEMM where C2050 can reach 313Gflops.

Reference: GeForce GTX 480 has 1/8th Double Precision Performance

avidday · May 31, 2011, 5:22am

All consumer Geforce Fermi cards are limited to 1/4 the double precision throughput of the equivalent Telsa or Quadro card. Seibert’s calculation is correct, and yours is, unfortunately, wrong. The ratio of single precision to double precision performance is 8:1 in consumer GF100 and GF110 cards.

hamster143 · May 31, 2011, 10:10pm

Table 5.1 of the programming guide claims that CC 2.0 devices have the throughput of 16 double-precision operations per clock cycle per multiprocessor. So I’d expect 1.544 * (512/32) * 16 = 396 GFLOPS. Am I wrong?

taid · May 31, 2011, 10:14pm

It’s a shame that nvidia’s cards have 3 times lower double precision flops than Ati’s. Especially for folks (me) who do/want to get the most out of participating in distributed computing projects that rely on double precision calculations.

seibert · May 31, 2011, 10:59pm

Two issues:

The tables says that the throughput is 16 multiply-add (or just multiply or just add) instructions per clock cycle per multiprocessor. It is traditional in FLOPS-counting on CUDA devices to compute throughput using the multiply-add instruction, which counts as two floating point operations per instruction. That raises your estimate by a factor of 2.
The GeForce compute capability 2.0 devices are capped somehow to have 1/4 the double precision speed of their Tesla and Quadro equivalents, as aviday mentioned. (It is unfortunate that NVIDIA chose not to disclose this ambiguity in compute capability 2.0 throughput in the programming guide.) That drops you back down by a factor of 4, to give the number I proposed initially. Basically, GeForce compute capability 2.0 and 2.1 devices have the same double precision throughput, whereas Tesla compute capability 2.0 has the throughput listed in the manual. If a Tesla compute capability 2.1 chip existed, then that column would probably look like the CC 2.0 column.

The GeForce limitations were discussed a lot when Fermi first came out. Here is Sumit Gupta’s list of differences between Tesla and GeForce versions of the Fermi chip:

(I would be thrilled to discover that the programming guide number applies to all CC 2.0 devices thanks to some driver update since that time, but I doubt that.)

hamster143 · May 31, 2011, 11:19pm

OK, I tried to measure FLOPS directly on a 580 and I got 1354 GFLOPS in single precision and 198 GFLOPS in double precision (counting each FMA as two instructions.) The programming guide is misleading.

Let me see what I can get out of a 6970…

edit: after some tweaking and vectorization, 6970 reaches 1970 GFLOPS and 636 GFLOPS single/double precision.

hyqneuron · June 1, 2011, 8:42am

I wonder why the ATI cards don’t perform better in games, since their GFLOP numbers are so much better.

Edit: I know AMD’s stream cores each has 4 fp(or general? i don’t know) processing elements. So this means a maintained ILP depth of 4 is required for games to utilize the cores fully. Maybe this could be a reason. I was also guessing that certain limitations on AMD’s 8-byte VLIW may damage the stream cores’ multi-issuing capability.

hamster143 · June 1, 2011, 10:56am

Their GFLOP numbers are much better in double precision. Nobody uses double precision in games (up until very recently ATI didn’t even have double precision).

In single precision, the difference between 1970 vs 1354 GFLOPS can be cancelled out by architectural & compiler deficiencies. For example, getting full performance on ATI requires full utilization of all 4 processing units in each VLIW. Unless you consciously vectorize your code, or tweak it going back and forth between your code and the output assembly, you could easily end up with the utilization factor of 0.75, and that eats most of your theoretical advantage in peak FLOPS.

DrAnderson42 · June 1, 2011, 11:35am

Most scientific applications are memory bandwidth limited, not FLOP limited. So the performance reduction due to using doubles is usually only factor of 2 (due to twice the bytes being loaded/stored).

Cosmin_P · August 18, 2011, 10:19am

I tried to benchmark the instruction throughput for single precision floating point and 32 bit integer operations on my GTX580. I’ve got some very weird results:

-32 bit integer multiply-add: 814 GFLOPS

-single precision multiply-add: 1422 GFLOPS

The multiply-add instruction was counted of course as two operations.

Next I tried just the multiply instruction with the following results:

-32 bit integer multiply: 407 GFLOPS

-single precision multiply: 813 GFLOPS

and just the add instruction with the results

-32 bit integer multiply: 407 GFLOPS

-single precision multiply: 812 GFLOPS

As you can see, somehow the throughput of 32-bit integer instructions is roughly half of that of the 32-bit floating point instructions (and thereby half of the throughput reported in 5.4.1 in the Programming Guide) regardless which operation is chosen for benchmarking.

Any idea on this?

Jimmy_Pettersson · August 18, 2011, 10:54am

Yes I did the exact same test recently. If you read the programming guide ( page ~98 i think ) on instruction throughput you’ll see that multiply-add throughput is only 16 instead of 32 (i.e. it takes 2 instructions to perform integer multiplication addition).

EDIT: Has anyone tested how fast th mul24 is on Fermi arch ?

Gert-Jan · August 18, 2011, 12:37pm

I tried to benchmark the instruction throughput for single precision floating point and 32 bit integer operations on my GTX580. I’ve got some very weird results:

-32 bit integer multiply-add: 814 GFLOPS

-single precision multiply-add: 1422 GFLOPS

The multiply-add instruction was counted of course as two operations.

Next I tried just the multiply instruction with the following results:

-32 bit integer multiply: 407 GFLOPS

-single precision multiply: 813 GFLOPS

and just the add instruction with the results

-32 bit integer multiply: 407 GFLOPS

-single precision multiply: 812 GFLOPS

As you can see, somehow the throughput of 32-bit integer instructions is roughly half of that of the 32-bit floating point instructions (and thereby half of the throughput reported in 5.4.1 in the Programming Guide) regardless which operation is chosen for benchmarking.

Any idea on this?

Take a look at this forum topic: throughput of integer add. Integer addition throughput is also discussed here, as it appears to be half of floating point addition throughput. Unfortunately it turns out to be a bug in nvcc 4.0, so you can try nvcc 3.2 or wait until it is fixed (bug report is already send in).

voilouvoila · April 10, 2014, 9:55pm

[quote=“seibert”]

I know this is an old thread, but bear with me.

Is there any official statement that the GeForce cards of cc 2.0 and 2.1 have the same fp64 throughput? or equivalently that they are capped to 1/4 of the throughput of their Tesla and Quadro equivalents?

I am not disputing that this is the case, but has there been official recognition given somewhere? (other than the ambiguous “(*) Throughput is lower for GeForce GPUs” in the programming guide).

Topic		Replies	Views
Double precision GFlops of Kepler CUDA Programming and Performance	10	10336	April 7, 2012
DP Performance of GTX580 CUDA Programming and Performance	3	8472	December 22, 2010
GTX 580 is not as good as GTX480 for CUDA ? CUDA Programming and Performance	23	4217	November 7, 2010
Double precision throughput on GTX's CUDA Programming and Performance	2	3591	August 12, 2011
Instruction throughput table CUDA Programming and Performance	0	6165	November 17, 2011
Double precision performance CUDA Programming and Performance	5	5804	May 22, 2011
Double precision: GTX 465, GTX 480 and C2050 CUDA Programming and Performance	16	4003	September 9, 2010
Fix for GTX480 DP performance CUDA Programming and Performance	18	17002	August 20, 2010
GTX2xx double precision support CUDA Programming and Performance	1	2029	October 16, 2009
8800GTX:345GFlops or 518GFlops? CUDA Programming and Performance	8	9758	December 12, 2007

what is the double-precision flops rating of the gtx580?

Related topics