Recently converted my kernels to use cache (Texture).
The program runs fine (Debug and Release, x64 and win32) on my older machine with 540M (CC 2.1, CUDA 4.1).
On the newer machine with 650 (CC3.0, CUDA 5) Debug versions run fine, but Release versions (w/o the -G) produce wrong results.
The older version of my program, which does not use Texture, runs fine on both machines; Debug and Release, x64 and win32.
A cudaDeviceSynchronize() follows all my kernel launches and I do check the return value :-)
I also check the return values of cudaBindTexture() cudaUnbindTexture() etc.