Can someone please comment on the double precision support in CUDA for these cards? If I bought one of these cards would double precision be available out of the box with CUDA 2.0, or is that something I would have to wait a few months before that support arrived?

Important to understand, indeed. In my view, the DP design decision was an excellent

sweet spot design decision. If all FPUs were double, they’d take more die space, giving you less of them… so less FP FLOPS. Yet double precision can be useful, in fact sometimes critical. Having BOTH frees up a lot of algorithm restrictions, especially allowing certain core crucial computes to at least be possible on the GPU.

An example, in raytracing, ray directions is totally fine in single precision. WORLD coordinate positions pretty much have to be doubles. But within a small model region (say a model voxel), single precision is fine for relative ray/triangle positions. So you might do a world transformation with a double compute, and then switch to single precision for your voxel traversal and intersections. This technique is used even in commercial CPU tracing algorithms… and now can be used in GPUs as well.

I’m looking forward to the updated 2.0 CUDA guide with all the details of the new architecture. Number of registers seems to have been doubled, which is ALSO awesome… as important as double support I feel.

I wonder if 32-bit integer mults are now one-clock now, the previous docs implied that it’d be switching from 4 clocks down to 1.

I wonder whether it is faster to emulate double precision with single precision or to use native double precision? Take a look at NVidia’s version of my Mandelbrot program in the SDK. They added the option to switch between single precision, double precision and emulated double precision (which messed up the UI a bit but I don’t mind). The program can output performance numbers for each mode. I am hoping someone with a new 280 card and the latest SDK can output the results.

There is an 8 to 1 ratio of double precision to single precision units on the GTX 200-series cards, which sets the break-even point.

For comparison, Kahan summation requires 4 operations per element. In cases where the need is just to limit round-off error, Kahan is faster than DP.

“Pseudo” double precision with two floats as implemented in dsfun90 only gets you 48 bits of mantissa rather than 53 like full double precision. Addition in that algorithm requires 11 operations because the MAD operation in CUDA has an intermediate truncation. So for that case, native double precision wins. It wins even more if you are doing a double precision multiply-add, instead of just an add.

But double precision is much slower… so the GPUs are almost at the same speed as CPUs. With the upcoming Intel Nehalem you get four times that performance with no need to learn a new API.

double precision performance is delivered at a much more modest 100 gigaflops.

NVIDIA Unveils Teraflop GPU Computing
Michael Feldman, HPCwire Editor

NVIDIA has announced two new Tesla-branded GPU computing products at ISC’08, continuing the company’s efforts to move into the HPC market. The new products are based on NVIDIA’s next generation 10-series GPU processor architecture. The T10P processor unveiled today offers double precision float point support, more local memory, plus much higher overall performance. NVIDIA is touting the new 10-series chip as the second generation processor for CUDA, the company’s GPU computing development platform.

The T10P, which is built on 55nm process technology, doubles the capability of the previous generation Tesla offerings, which were based the 8-series NVIDIA architecture. The new GPU has twice the FP precision (32-bit to 64-bit) and the raw compute performance (500 gigaflops to 1 teraflop). It’s important to note that the teraflop figure is single precision performance; double precision performance is delivered at a much more modest 100 gigaflops.

If I assume that a single Penryn core @ 3 GHz can complete 1 SSE instruction (2 doubles) per clock, that’s 3e9 (clock) * 2 (SSE) * 4 (cores) = 24 GFLOPS of double precision. That’s not bad, but it still is slower than the reports of the GTX 280.

Moreover, if you were doing these double precision operations to a very large array, the CPU would blow through the L2 cache pretty quick, and then you would be stuck pulling elements down through relatively slow system memory bus.

So, there are still trade-offs. For small operations or medium-sized operations with minimal data parallelism, the CPU is easier to program and faster thanks to the fast L2 cache. For big stuff, the GPU wins by pairing floating point units with an enormous memory bus.

As an example for what seibert said, consider a simple vector addition, the good old axpy from blas:

for (i=0; i<N; i++)
y[i] += alpha*x[i];

N is (in my apps) typically really large, definitely above 1M. This operation is obviously limited in performance by memory, and there is no data reuse. On an early engineering sample of the T10P (the Tesla version of the GTX280), I am seeing 114 GByte/s for this operation, which in single precision boils down to 20 GFLOP/s and in double to 10 GFLOP/s. Note that the early engineering sample might not reflect actual performance of the “real” hardware, but it should be reasonably close. The best I have seen (out of cache) on the CPU is around 1GFLOP/s in single and 500 MFLOP/s in double, i.e. roughly 6 GByte/s.

Morale: For memory-bound applications, the double precision performance of the GT200 is more that enough. For compute-bound applications, there is still room for improvement, admittedly.

Hi there
I noticed there are no math functions for double-precision in CUDA2.0beta, will they be included in a final 2.0 realease? is there any exsample about double-precision computing we can follow in 2.0beta?
thanks for your work!