Shared Memory vs Registers' floating point accuracy

I am running a kernel which has computations involving very small values.

My problem is :
“Each thread runs a loop for 720 times.
There are two particular values which do not depend on the loop.
The gpu output differs if I put this inside the loop and calculate again and again.
Though the difference is of the ordeer of 10E-4, I am not sure why this is happening when the calculatio is same and the hardware is same.
Has somebody experienced a similar problem??”
All data-types are float
[GT640, CUDA 5.0]
Code1:
temVy = (float)((CudaParamsD.VxNum[1]/2 - j) - 0.5) * CudaParamsD.SizeObj[1];
temVx = (float)((i - CudaParamsD.VxNum[0]/2) + 0.5) * CudaParamsD.SizeObj[0];

for(int IndAng=0; IndAng < CudaParamsD.DetNum[0]; IndAng++)
{
     ---
     ---
    }

Code2:

for(int IndAng=0; IndAng < CudaParamsD.DetNum[0]; IndAng++)
{
            temVy = (float)((CudaParamsD.VxNum[1]/2 - j) - 0.5) * CudaParamsD.SizeObj[1];
    	temVx = (float)((i - CudaParamsD.VxNum[0]/2) + 0.5) * CudaParamsD.SizeObj[0];
            --------
            --------
     }

Here i, j are global threadIds and not in any loop.

Just a note: don’t use 0.5 - always use 0.5f, otherwise you may see your CUDA code taking the slower double precision code path.