GTX 280, CUDA and Double Precision

Boxed_Cylon · June 17, 2008, 1:00am

I see that newegg.com now has the GTX 280 for sale now (ca. $650):
[url=“Desktop Graphics Cards | Video Cards - Newegg.com”]http://www.newegg.com/Product/ProductList....tk=&srchInDesc=[/url]
(link fixed 6/17/08)

Can someone please comment on the double precision support in CUDA for these cards? If I bought one of these cards would double precision be available out of the box with CUDA 2.0, or is that something I would have to wait a few months before that support arrived?

Thx.,
B.-C.

netllama · June 17, 2008, 1:07am

Full support will be available in the upcoming release (this month).

Ailleur · June 17, 2008, 2:16am

I have read (anandtech and arstechnica) that only 1 double precision is available per SM.

I guess it is worth noting before jumping on the buy button.

SPWorley · June 17, 2008, 2:29am

Important to understand, indeed. In my view, the DP design decision was an excellent

sweet spot design decision. If all FPUs were double, they’d take more die space, giving you less of them… so less FP FLOPS. Yet double precision can be useful, in fact sometimes critical. Having BOTH frees up a lot of algorithm restrictions, especially allowing certain core crucial computes to at least be possible on the GPU.

An example, in raytracing, ray directions is totally fine in single precision. WORLD coordinate positions pretty much have to be doubles. But within a small model region (say a model voxel), single precision is fine for relative ray/triangle positions. So you might do a world transformation with a double compute, and then switch to single precision for your voxel traversal and intersections. This technique is used even in commercial CPU tracing algorithms… and now can be used in GPUs as well.

I’m looking forward to the updated 2.0 CUDA guide with all the details of the new architecture. Number of registers seems to have been doubled, which is ALSO awesome… as important as double support I feel.

I wonder if 32-bit integer mults are now one-clock now, the previous docs implied that it’d be switching from 4 clocks down to 1.

koon · June 17, 2008, 6:59am

Thanks Now I ordered one at NewEgg :)

I agree to SPWorley, Single / Double blending % on GT200 looks nice.

Other example, some mathematics, run in single for a while / move to double at some steps before convergence will work fine.

grangerfx · June 17, 2008, 9:49pm

I wonder whether it is faster to emulate double precision with single precision or to use native double precision? Take a look at NVidia’s version of my Mandelbrot program in the SDK. They added the option to switch between single precision, double precision and emulated double precision (which messed up the UI a bit but I don’t mind). The program can output performance numbers for each mode. I am hoping someone with a new 280 card and the latest SDK can output the results.

-Mark Granger

seibert · June 18, 2008, 3:21am

There is an 8 to 1 ratio of double precision to single precision units on the GTX 200-series cards, which sets the break-even point.

For comparison, Kahan summation requires 4 operations per element. In cases where the need is just to limit round-off error, Kahan is faster than DP.

“Pseudo” double precision with two floats as implemented in dsfun90 only gets you 48 bits of mantissa rather than 53 like full double precision. Addition in that algorithm requires 11 operations because the MAD operation in CUDA has an intermediate truncation. So for that case, native double precision wins. It wins even more if you are doing a double precision multiply-add, instead of just an add.

lasse1 · June 18, 2008, 5:36am

But double precision is much slower… so the GPUs are almost at the same speed as CPUs. With the upcoming Intel Nehalem you get four times that performance with no need to learn a new API.

double precision performance is delivered at a much more modest 100 gigaflops.

NVIDIA Unveils Teraflop GPU Computing
Michael Feldman, HPCwire Editor

NVIDIA has announced two new Tesla-branded GPU computing products at ISC’08, continuing the company’s efforts to move into the HPC market. The new products are based on NVIDIA’s next generation 10-series GPU processor architecture. The T10P processor unveiled today offers double precision float point support, more local memory, plus much higher overall performance. NVIDIA is touting the new 10-series chip as the second generation processor for CUDA, the company’s GPU computing development platform.

The T10P, which is built on 55nm process technology, doubles the capability of the previous generation Tesla offerings, which were based the 8-series NVIDIA architecture. The new GPU has twice the FP precision (32-bit to 64-bit) and the raw compute performance (500 gigaflops to 1 teraflop). It’s important to note that the teraflop figure is single precision performance; double precision performance is delivered at a much more modest 100 gigaflops.

seibert · June 18, 2008, 11:28am

If I assume that a single Penryn core @ 3 GHz can complete 1 SSE instruction (2 doubles) per clock, that’s 3e9 (clock) * 2 (SSE) * 4 (cores) = 24 GFLOPS of double precision. That’s not bad, but it still is slower than the reports of the GTX 280.

Moreover, if you were doing these double precision operations to a very large array, the CPU would blow through the L2 cache pretty quick, and then you would be stuck pulling elements down through relatively slow system memory bus.

So, there are still trade-offs. For small operations or medium-sized operations with minimal data parallelism, the CPU is easier to program and faster thanks to the fast L2 cache. For big stuff, the GPU wins by pairing floating point units with an enormous memory bus.

e.ping · June 18, 2008, 12:17pm

As an example for what seibert said, consider a simple vector addition, the good old axpy from blas:

for (i=0; i<N; i++)

  y[i] += alpha*x[i];

N is (in my apps) typically really large, definitely above 1M. This operation is obviously limited in performance by memory, and there is no data reuse. On an early engineering sample of the T10P (the Tesla version of the GTX280), I am seeing 114 GByte/s for this operation, which in single precision boils down to 20 GFLOP/s and in double to 10 GFLOP/s. Note that the early engineering sample might not reflect actual performance of the “real” hardware, but it should be reasonably close. The best I have seen (out of cache) on the CPU is around 1GFLOP/s in single and 500 MFLOP/s in double, i.e. roughly 6 GByte/s.

Morale: For memory-bound applications, the double precision performance of the GT200 is more that enough. For compute-bound applications, there is still room for improvement, admittedly.

kaoken · June 22, 2008, 6:43pm

I believe they can do 4 DP flops per clock per core giving peak of 48gflops.

Its difficult to get good efficiency however and I believe GT200 will beat current quad-cores for many DP apps.

halyavin · June 22, 2008, 8:11pm

I think the best solution is to use both GPU and CPU simultaneously. This requires very complex programming of course.

nasacort · July 2, 2008, 5:52pm

Can I use the GTX280 or 260 on a motherboard with PCIE 1 as opposed to PCIE 2?

Thanks.

tmurray · July 2, 2008, 5:56pm

Yes, PCIe 2 is backwards compatible with PCIe 1. (I’m doing it right now)

darkstorm · July 17, 2008, 3:32pm

Hi there
I noticed there are no math functions for double-precision in CUDA2.0beta, will they be included in a final 2.0 realease? is there any exsample about double-precision computing we can follow in 2.0beta?
thanks for your work!

seibert · July 17, 2008, 4:21pm

What math functions are missing? I’ve used double-precision exp() and sincos() in my code.

Topic		Replies	Views
Best, bang-for-the-buck, CUDA platform? ... Which? 9800 GX2, Tesla C870, new 2xx ... CUDA Programming and Performance	23	10616	July 15, 2008
Student buying card for CUDA. Which one? CUDA Programming and Performance	16	14887	December 4, 2012
CUDA Double Precision Performance 933 GFlops vs 78GFlops CUDA Programming and Performance	17	10009	March 9, 2009
Emulated double precision Double single routine header CUDA Programming and Performance	24	49212	October 18, 2010
TITAN X CUDA Programming and Performance	35	10438	March 23, 2015
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8256	September 9, 2011
PCIe 16x wired as 8x effect on card use (gtx280, gtx295, C1060) CUDA Programming and Performance	13	16315	January 12, 2009
advice needed by a PhD student CUDA Programming and Performance	26	2918	December 4, 2011
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	154	January 19, 2025
Double precision support in future chips? CUDA Programming and Performance	6	23518	February 21, 2007

GTX 280, CUDA and Double Precision

NVIDIA Unveils Teraflop GPU Computing Michael Feldman, HPCwire Editor

Related topics

NVIDIA Unveils Teraflop GPU Computing
Michael Feldman, HPCwire Editor