I use 8800GTX, and recently moved to CUDA 2.0 beta from 1.0. It is because the nvcc in 1.0 does not work well if the code uses many registers.
Anyway, I recompiled and ran the software I wrote in CUDA 1.0 on the 2.0 beta, and I found that the execution took always slower in 2.0.
For example, one of my kernels took about 300 ms when using 1.0, but it took about 400~500 ms in 2.0 beta.
Currently, there is no problem to compile the codes on CUDA 2.0 beta.
Do I have to give some options to nvcc to reach the performance of 1.0?