Handling Double Precision Operations A few questions about double-precision support

I read here that each core has one single-precision ALU per core, and only one double precision ALU per eight cores. Is this information accurate regardless of the device being used (given that the device is of at least compute capability 1.3)? I’m using a machine with four Tesla C1060’s.

If I compile with “-arch=sm_13” will that use the double precision ALU?
Also, is it possible to do the double precision operations, only using the single precision ALUs?
Would there be a speed increase/decrease by doing this?
Are there any libraries already created to do this?

For double precision variables, yes.

You can use various tricks (e.g. Kahan summation). But it’ll probably just be faster to have the warps shuffle through the DP unit.

There is a form of arithmetic called “double-single”, which emulates higher precision using several single precision variables. It’s not true double precision, as two floats together have only 48 bits of mantissa, whereas a true double has 53 bits. For addition and multiplication, double-single requires somewhere between 11 and 17 single precision instructions. More complex operations are dramatically slower. Given that GeForce compute capability 1.3 and 2.0 devices run double precision at 1/8 the rate of single precision, using doubles directly is still faster. (Telsa C2050/70 run double precision natively at 1/2 of the single precision rate.) I’m pretty sure someone has ported parts of the dsfun90 library (an old fortran library that implements double-single) to CUDA. Search for it with Google and I bet you’ll find it.

Kahan summation requires 4 floating point operations per item added to the total, so in the case where that algorithm is adequate, it is faster than native double precision except on the new Teslas.