For well-optimized, single precision, compute bound kernels, it sounds like you’ll get roughly a factor of 2 over a GTX 285 or Tesla C1060. For memory-bound kernels that already have good coalescing, a little less than a factor of 2. (Probably. Although the architecture has been well documented, the clock rates on soon to be released hardware is still the subject of rumors.)
One thing to consider is that the huge architecture improvements to Fermi will improve the performance of code which did not run well in CUDA before by much more than a factor of 2. It’s hard to quantify how much improvement that will be without getting your hands on some hardware, though. I have several test kernels I want to write and run on Fermi to see how they perform. Algorithms that make heavy use of atomics, or random access data that will fit in the L1 or L2 caches, or use a lot of double precision operations will see much larger performance increases.
So in some ways, the people who should look closest at Fermi are the people with kernels that run poorly on GT200.