On cards with compute capability less than 1.3, the CUDA compiler automatically demotes doubles to floats for you. Which is why the programming guide (and the NVIDIA folks in technical presentations) always emphasize that you must use single precision literals (0.0f instead of 0.0) if you want portable single precision code: Otherwise, you implicitly rely on the demoted types, and your single precision code on DP-HW will suddenly run much slower, because it actually uses doubles for computation (and stores results again in floats, obviously).
Someone posted a neat emulated precision template header recently, you might want to search the forums for Bailey or dsfun90. This gives you an s46e8 type, which is about halfway between a float and a double.
Just so you know, double* does not behave properly on pre-1.3 hardware. It reads four bytes instead of eight and doesn’t do a double → float cast, so it just doesn’t work at all.