This defect doesn’t occur for cuda-5.0 and for some Keplers and Maxwells I tested.
If anyone has Fermi architecture card, and cuda-6.5 or 7, please run this test. I want to know is it really nvcc error?
This turns off optimizations in the compiler backend (which compiles PTX to SASS). The compiler backend contains many architecture-specific transformations, which could explain why you observe a difference when building for Fermi platforms rather than Kepler or Maxwell platforms. If the problem goes away with -Xptxas -O0, that would also seem to exclude a link-time error.
For the record, when I compile the code with CUDA 7.5, and then run the resulting executable on an sm_50 class GPU, it returns the correct result, meaning E=1.0.