So far the double precision support has worked as advertised, but the description in the Programming Guide is lacking in some details. Some additions I would like to see:

Quantification of the performance tradeoffs for double vs. single arithmetic. The guide states how many clock cycles single precision operations require, but no guidance is give for double precision. Is it twice as slow? More? Is the fused multiplyadd for doubles a single hardware instruction, like for singles?

Is there intermediate truncation in the fused multiplyadd operation for doubles like there is for singles? This affects implementing extended precision arithmetic by combining doubles together. (Not an immediate need of mine, but I bumped into this same issue when faking double precision using singles in CUDA.)

Are there no intrinsic double precision functions (equiv to __sinf(x), etc.)? I assume not, since rummaging through the headers shows implementations of cos(x), sin(x), etc that use argument reduction + polynomial approximation over the reduced interval. But, I figured I would check. :)
Also, one more suggestion: I foresee in some of our code having both native double as well as pseudodouble (pair of singles) versions of kernels to support both >= and < sm13 architectures. In these kinds of kernels, I would rather not have automatic double>float demotion for presm13. Instead, it would be better for the compiler to throw an error or warning. Can this be added as a flag to nvcc?
I’m imagining something like Wnodoubleconvert which would report any time a double precision variable, constant or function was used when the arch does not support it.