My question would be: how is DP achieved on NV hardware that supports it? Half of the ALU units are DP while the rest is SP? Hidden register usage to make DP operation on SP unit? I heard that HW that does not support DP natively can still do DP operations, but they use SP operations instead without prior notice, so SP is a fallback if DP is not availble. Is this true?

The implementation varies depending on card generation. The GTX 200 series (compute capability 1.3) was reported to have a single, dedicated DP unit on the multiprocessor, separate from the 8 CUDA Cores that did single precision, integers, etc. This was the origin of the 8 to 1 speed difference between single and double precision.

Fermi cards (capability 2.0) use a different implementation that joins two CUDA cores to perform a double precision operation. Presumably there is some extra coordinating circuitry that makes this possible. This gives Fermi GPUs a 2 to 1 speed difference, in principle, between single and double precision. This full performance DP is only available on the Tesla Cards, while the GeForce cards are capped at an 8:1 ratio. Compute capability 2.1 modified the 2.0 design slightly by adding an additional 16 CUDA cores per multiprocessor (bringing the total up to 48) that could not perform double precision instructions at all. This skews the ratio a little further, to 12:1 between single and double precision, assuming maximum efficiency for scheduling single precision operations.

Note that devices without hardware double precision (compute capability 1.0-1.2) can’t emulate it either. The CUDA compiler (no idea how this is handled in OpenCL) simply downgrades all your double precision variables to single precision and prints a warning letting you know what happened.

You can manually emulate higher precision floating point in software, of course. A common approach, found in the old dsfun90 library, involves using two single precision variables with an exponent shift between them allowing one to act as an extension of the mantissa of the other. True double precision has 53 bits of mantissa, but this double-single trick only gives you 48 bits. Addition is the fastest double-single operation, and that still takes about 17 instructions. Multiplication is much slower. As a result, even the relatively slow hardware double precision found in compute capability 2.1 is better than trying to emulate it this way.