Precision of floats does CUDA use half precision instead of single precision for floats?

I wrote a simple runge-kutta solver using CUDA, and a Java version using floats. In the CUDA version, a number gets rounded up to 0.010000 which resolves to 0.0099975 in my Java code. This doesn’t make sense to me, since float is an IEEE standard. Also, when I change all my floats to doubles and compile using -arch sm_13, the error doesn’t go away. Any thoughts?

there are two reasons of rounding error when compare GPU’s result to CPU’s result

(1) non-commutative property of floating operation

( a + b ) + c is not equal to a + ( b + c ) 

(2) fused muliply and add in GPU

consider

Csub += AS(ty, k) * BS(k, tx);

nvcc use one MAD to compute it.

However if you want to match CPU’s result, you must use

#ifdef DO_DOUBLE

			Csub = Csub + __dmul_rn( AS(ty, k), BS(k, tx));

#else

			Csub = Csub + __fmul_rn( AS(ty, k), BS(k, tx));		

#endif

please see section A.2 in programming guide 2.3

You can test above two reasons in matrix multiplication example.

#if defined(DO_DOUBLE)

	typedef double doublereal;	

#else

	typedef float doublereal;

#endif

#define CHECK_BANK_CONFLICTS 0

#if CHECK_BANK_CONFLICTS

#define AS(i, j) CUT_BANK_CHECKER(((doublereal*)&As[0][0]), (BLOCK_SIZE * i + j))

#define BS(i, j) CUT_BANK_CHECKER(((doublereal*)&Bs[0][0]), (BLOCK_SIZE * i + j))

#else

#define AS(i, j) As[i][j]

#define BS(i, j) Bs[i][j]

#endif

__global__ void matrixMul( doublereal* C, doublereal* A, doublereal* B, int wA, int wB);

void  matrixMul_wrapper( doublereal* d_C, doublereal* d_A, doublereal* d_B, 

	int hA, int wA, int wB) 

{

	int hC = hA;

	int wC = wB;

	

	// setup execution parameters

	dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

	dim3 grid(wC / threads.x, hC / threads.y);

	// execute the kernel

	matrixMul<<< grid, threads >>>(d_C, d_A, d_B, wA, wB);

}	

__global__ void matrixMul( doublereal* C, doublereal* A, doublereal* B, int wA, int wB)

{

	// Block index

	int bx = blockIdx.x;

	int by = blockIdx.y;

	// Thread index

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	// Index of the first sub-matrix of A processed by the block

	int aBegin = wA * BLOCK_SIZE * by;

	// Index of the last sub-matrix of A processed by the block

	int aEnd   = aBegin + wA - 1;

	// Step size used to iterate through the sub-matrices of A

	int aStep  = BLOCK_SIZE;

	// Index of the first sub-matrix of B processed by the block

	int bBegin = BLOCK_SIZE * bx;

	// Step size used to iterate through the sub-matrices of B

	int bStep  = BLOCK_SIZE * wB;

	// Csub is used to store the element of the block sub-matrix

	// that is computed by the thread

	doublereal  Csub = 0.0;

	// Loop over all the sub-matrices of A and B

	// required to compute the block sub-matrix

	for (int a = aBegin, b = bBegin;

			 a <= aEnd;

			 a += aStep, b += bStep) {

		// Declaration of the shared memory array As used to

		// store the sub-matrix of A

		__shared__ doublereal As[BLOCK_SIZE][BLOCK_SIZE];

		// Declaration of the shared memory array Bs used to

		// store the sub-matrix of B

		__shared__ doublereal Bs[BLOCK_SIZE][BLOCK_SIZE];

		// Load the matrices from device memory

		// to shared memory; each thread loads

		// one element of each matrix

		AS(ty, tx) = A[a + wA * ty + tx];

		BS(ty, tx) = B[b + wB * ty + tx];

		// Synchronize to make sure the matrices are loaded

		__syncthreads();

		// Multiply the two matrices together;

		// each thread computes one element

		// of the block sub-matrix

		for (int k = 0; k < BLOCK_SIZE; ++k){

			Csub += AS(ty, k) * BS(k, tx);	

		}

		// Synchronize to make sure that the preceding

		// computation is done before loading two new

		// sub-matrices of A and B in the next iteration

		__syncthreads();

	}

	// Write the block sub-matrix to device memory;

	// each thread writes one element

	int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

	C[c + wB * ty + tx] = Csub;

}

A couple of other thoughts… if you’re on an x86 processor, and you aren’t forcing usage of SSE, it’s possible that all your FP operations on the CPU are going through the x87 unit. That uses 80-bit precision for everything, and hence will give different answers. Also, IIRC, all negative powers of 10 are infinitely recurring decimals in binary, so getting a result of 0.01000000 means that something got rounded somewhere (probably on output).

Indeed, in modern X86 compilers depending on your compiler flags, many of the internal operations use double precision and only the result is single precision. This situation of difference of internal precision is not only common to X86 vs GPU, but also X86 vs others architectures ( http://www.christian-seiler.de/projekte/fpmath/ ).

Sam

My apologies for nit-picking, but this is actually the “non-associativity” property of floating point operations. I believe floating point is still commutative. That is to say: a + b = b + a.

Try printing the CUDA result with more digits:
[font=“Courier New”] printf("%.10f\n", x);[/font]