Double precision GFlops of Kepler

Hi everyone,

does anyone know the double-precision performance in terms of GFlops of the GTX680 and GT640M?

Many thanks!

Table 5-1 of the Programming Guide shows it is 8/192 (or 1/24) of the single precision throughput, so 128 GFLOP/s for the GTX680 and “up to” 20 GFLOP/s for the GT640M.

Do you know what is the theoretical double precision per watt compare to the 470 or 480 series?

Maybe the documented double precision throughput divided by the documented power consumption? Shouldn’t be too difficult to look up these values.

0.88 GFLops/W for 580
0.66 GFLops/W for 680

Yeah, it’s quite crippled. However, what I find quite strange is that the throughput of integer operations has also gotten quite crippled, the 32-bit integer add is only 87.5%, shift/compare 1/12th (!), and mul/mad 1/24th (!!!). This will cripple integer arithmetic-intensive kernels.

I guess I’ll start replacing integer multiplications with 2 or 3 with additions…

I’m wondering where the 8/192 came from, what version of the Programming Guide that is? Because all I can find is the 4.1 version of the C Prog Guide and I don’t see that ratio in that table. Aside from that, reading the GeForce-GTX-680-Whitepaper-FINAL.pdf, it has on page 6 for GFLOPs: GT200(Tesla)1063, GF110(Fermi)1581, GK104(Kepler)3090, which are wildly higher than what you calculated. Granted it’s a whitepaper, but it should be in the right ballpark.

Using the same chart and to answer another question above, I calculate a GFLOP/Watt value of 6.48 for Fermi and 22.3 for Kepler.

I guess you missed the part about double precision GFLops…

Does anyone have any idea if either the DP or the integer performance will get any better in future? As things stand now, GTX680 is 1.5x slower than GTX580 on DP ops or integer multiplications, and 6x slower on integer shift/compares.

Oh, and has that ‘8’ in the table 5-1 been verified to apply to GTX680? It shows 16 double-precision ops per multiprocessor in compute capability 2.0, and, as I recall, the 16 was only applicable to Tesla/Quadro, and regular gaming cards were crippled and they only did 4 double-precision ops per multiprocessor.

NVIDIA has a long history of not discussing unannounced products, so you aren’t going to get an answer here. Your guess is as good as anyone elses.

Sure, but only if 100% of your instructions are any of these. Real world apps have a mix of instructions. I wouldn’t overanalyze and claim that the 680 is bad for compute until we have a number of real-world CUDA application benchmarks (which we are sorely lacking…). I don’t have a 680 yet, or I would be posting such benchmarks.

I’m doing a lot of testing now and it’s very very bad. Too bad. Much worse that it needs to be. I’m using sm_30 in their latest toolbox. I think they’re trying to push all compute folks into extremely expensive and profitable cards. if true, AMD, here I come!