Professor Jack Dongarra talks about BLAS and CUDA:
It looks like the traditional LAPACK/BLAS developers at U. Tenn. have adapted to GPU calculations like a fish to water. I suspect we can eventually look to them for a complete set of optimized LAPACK/BLAS subroutines.
Dongarra talks about speeding up calculations by doing them mostly in single precision, and then at the end doing a little double precision to reach a final, accurate answer. This seems to be a new strategy. This notion is consistent with the design of the GTX 260/280’s - mostly single precision with some double precision capability. It seems clear that the best use of the GPU is to go as far as one can in single precision and then invoke double precision only as needed.
Given that single precision is somewhere around 10x as fast on the GT200 cards (or so I’ve read), I think that is a good strategy. From an (mathematical) optimization standpoint, this means solving a system using single precision provides an approximate solution to the double-precision estimate – though once you are that close to the solution, it should take only a few iterations of some algorithm to get to the best double-precision solution.
I think it might be even beneficial for the algorithm being ported itself. I have encountered overflows when porting large algorithms to CUDA (from MATLAB) that turned out to be fragile pieces of code that were bound to overflow in double precision sooner or later. There are in some cases alternative algorithms that have no such problems (even in single precision) so your code is more stable as a result. It is a good thing to be ‘forced’ again to think about single precision (and its troubles), it can get you better algorithms.