different results on host and device with same code

I am trying to write code to do simulation on GPU with CUDA.

Periodic boundary condition is imposed on the simulation box. So the following codes is to deal with coordinates:

x[i] += dx;
x[i] -= L*rintf(x[i]*invL);

where L and invL is the box length and inverse box length. With such implementation, sometimes I get x[i] is half of box length L2 = L/2, which means that:

When I copy coordinate data to the host, and calculate the cell index ( simulation box is divided into cells with cell length cellLength, and cell number in each direction ncell ):

ix = (int)( ( x[i] + L2 )/cellLength );

sometimes I get ix = ncell, and print x[i] is just L2.

so I changed the above code ix = (int)( ( x[i] + L2 )/cellLength ); to

ix = (int)( ( x[i] + L2 )/cellLength );
if ( ix == ncell ) ix = 0;

I used the above two lines in both CPU and GPU to calculate particle’s cell label. However, with the same coordinate data, I got different results. I am wondering the error in the code for the above periodic condition and cell index calculation, or is the precision problem on CPU and GPU, are they different?

Precicion is different because by default double datatype is always converted to float. I also happened to step on it here http://forums.nvidia.com/index.php?showtopic=90206