CUDA innacuracy? CUDA float produces different result from CPU float

Hi everyone. I’m writing a CUDA code to calculate electron density inside a quantum dot, and the part for solving poisson equation is written in both CUDA and C++.

my problem is that when using cuda code the results are quite different but they converge(which I think they shouldn’t because of the error but I’m not sure). I was

wondering if this is normal in compute capability 1.1 devices (GeForce 9500GT)? (MSE is 0.000001 but in a 81*81 grid when the biggest number is about 0.001 it’s

really high and the electron density graphs are different)

here is the code for both gpu and cpu:

GPU:

__global__ void CudaPoissonSolve(float* rho, float* out, int numPoints, float delta)

{

	int j = threadIdx.x + blockIdx.x * blockDim.x;

	int i = threadIdx.y + blockIdx.y * blockDim.y;

	

	int idx = j + i * numPoints;

	if( i >= numPoints || j >= numPoints )

		return;

	float tempout = 0.0f;

	

	for(int k = 0; k < numPoints; k++)

	{

		float x1 = j * delta;

		float y1 = i * delta;

		for(int l = 0; l < numPoints; l++)

		{

			if( i == k && j == l )

				tempout += 2.0f * 1.77245385091f * rho[idx] / delta;

			else

			{

				float x2 = l * delta;

				float y2 = k * delta;

				float dx = x1 - x2, dy = y1 - y2;

				int index = j + i * numPoints;

				tempout += rho[index] / sqrtf( (dx * dx + dy * dy) );

			}

		}

	}

	out[idx] = tempout * delta * delta;

}

CPU:

void QuantumDot::PoissonSolve(QDGrid& rho, QDGrid& out)

{

	for(long i = 0; i < (long)mNumPoints; i++)

		for(long j = 0; j < (long)mNumPoints; j++)

		{

			float x1 = MeshX(i, j), y1 = MeshY(i, j);

			for(long k = 0; k < (long)mNumPoints; k++)

				for(long l = 0; l < (long)mNumPoints; l++)

				{

					float x2 = MeshX(k, l), y2 = MeshY(k, l);

					

					float dx = x1 - x2, dy = y1 - y2;

					

					if( i == k && j == l )

						out.At(i, j) += 2.0f * 1.77245385091f * rho.At(i, j) / mDelta;

					else

						out.At(i, j) += rho.At(k, l) / sqrt(dx * dx + dy * dy);

				}

				

			out.At(i, j) *= mDelta2; //mDelta2 = mDelta * mDelta

		}

}

Both the single-precision division operator “/” and the single-precision square root “sqrtf” are approximate on sm_1x devices. See the appropriate appendix of the CUDA Programming Guide for error bounds.

For sm_1x device the only way to get single-precision division and square root with IEEE-754 round-to-nearest-or-even (which is what the CPU provides) is via intrinsics (a.k.a. device functions). Note that the performance of these is significantly lower than the performance of the approximate versions. To give this a try replace

tempout += 2.0f * 1.77245385091f * rho[idx] / delta;

tempout += rho[index] / sqrtf( (dx * dx + dy * dy) );

with

tempout += __fdiv_rn (2.0f * 1.77245385091f * rho[idx], delta);

tempout += __fdiv_rn (rho[index], __fsqrt_rn ( (dx * dx + dy * dy) ));

Depending on the numerical properties of your code, one of those changes by itself may be sufficient. Before you try the sqrt() changes above, you might want to try explicit use of the approximate reciprocal square rsqrtf() instead, as that could provide both better performance and better accuracy:

tempout += rho[index] * rsqrtf( (dx * dx + dy * dy) )

Thanks, but they didn’t work, It has no noticeable change from my own code, I even changed them to

tempout = __fadd_rn(tempout, __fmul_rn(3.54490770182f, rho[idx]));

tempout = __fadd_rn(tempout, __fdiv_rn(rho[index], __fsqrt_rn( (dx * dx + dy * dy) )));

And removed some of devisions that would later be multiplied anyway (a numerical optimization you might say) but nothing changed.

Do you have any more suggestions?

I am not sure whether sqrt(dx * dx + dy * dy) in CPU uses double-precision or single-precision.
If CPU uses double-precision to do sqrt, then it is a problem because you do a singular integration,

integral(G(x,y)*rho(y)dy) where green function G(x,y)=1/|x-y|, it is singular near x,

i.e if dx ~ delta and dy~delta, then sqrt(dx * dx + dy * dy) is different from sqrtf(dx * dx + dy * dy).

I think you should try double-precision version (replace float by double) and check how large the deviation is.

The CPU version using float has no deviation from CPU using double in this case. The problem starts when I use cuda.

Also in this case delta isn’t that small to cause singularity, it’s about 0.6 .

You have typo in cuda code

int index = j + i * numPoints;

tempout += rho[index] / sqrtf( (dx * dx + dy * dy) );

should be

int index = l + k * numPoints;

tempout += rho[index] / sqrtf( (dx * dx + dy * dy) );

If following up on Lung Sheng’s remark does not resolve the discrepancies, here are a couple of additional potential causes for numerical differences between CPU code and GPU code that you could look into.

Note that CPU code frequently keeps intermediate results in registers, which provide at least double precision, if not extended precision (the latter frequently on Linux). To force the CPU computation strictly to single precision (which matches what happens in the GPU code), you can try changing the CPU code as follows:

#if 0

  out.At(i, j) += 2.0f * 1.77245385091f * rho.At(i, j) / mDelta;

#else // 0

  {

    volatile float temp;

    temp = 2.0f * 1.77245385091f;

    temp = temp * rho.At(i, j);

    temp = temp / mDelta;

    out.At(i, j) = out.At(i, j) + temp;

  }

#endif // 0
#if 0

  out.At(i, j) += rho.At(k, l) / sqrt(dx * dx + dy * dy);

#else // 0

  {

    volatile float temp1, temp2;

    temp1 = dx * dx;

    temp2 = dy * dy;

    temp1 = temp1 + temp2;

    temp1 = sqrtf(temp1);

    temp1 = rho.At(k, l) / temp1;

    out.At(i, j) = out.At(i, j) + temp1;

  }

#endif // 0

Lastly, on the GPU some add and multiplies might get contracted into multiply-add operations. While on sm_2x GPUs an IEEE-754(2008) compliant FMA (fused multiply-add) is used for this, on sm_1x devices the fused operation uses a multiply with truncation. To ensure that multiply-add contraction is not taking place you can use intrinsics, as follows:

float x2 = __fmul_rn(l, delta);

         float y2 = __fmul_rn(k, delta);                                

         float dx = __fadd_rn(x1, -x2), dy = __fadd_rn(y1, -y2);

         int index = j + i * numPoints;                                

         tempout = __fadd_rn (tempout, __fdiv_rn (rho[index], __fsqrt_rn (__fadd_rn (__fmul_rn (dx, dx), __fmul_rn (dy, dy)))));

Thanks everyone! I’t solved and works correctly. I’m sorry to trouble you with these questions and (stupid) mistakes. I’m not an amateur programmer and don’t know why I’m making such mistakes. but thanks again.

This is the way I came around this problem,

I couldn’t get -prec-sqrt=true to work, so I went for this algorythm : Babylonian method, Methods of computing square roots - Wikipedia

this works like this:

for calculating square root of S:

x_0 = initial guess;
x[sub]n+1[/sub] = 1/2.0 * ( x[sub]n[/sub] + S / x[sub]n[/sub] );

as for my initial guess I chose sqrtf() from cuda!!
for example below sqrt(1234567891) is calculated:


double x_n;
x_n=sqrtf(1234567891);
double s=1234567891;
for (int tt=1;tt<10;++tt){
x_n=1.0/2.0*(x_n+s/x_n);
}


this is the results:

Matlab on CPU:
3.513641830067487e+004

Babylonian method above with GPU:
3.5136418300674872e+004

Cuda sqrtf:
3.513641796875e+004

as you can see above method reaches great resluts to be compared with cpu in very low number of iterations!! while cuda sqrtf is really off!!
( I haven’t tried optimizing that tt=10 number of iterations!!)