cuda build rule 3.2 slower than 3.0 ?

I recently updated to the cuda toolkit 3.2 and installed Nsight. But my project is now running more than 10 times slower. Is this due to Nsight or the build rule?

Also, the compiler is not smart. I deleted the final resylt assignments, there is no change on runtime.