Precision of floats does CUDA use half precision instead of single precision for floats?

PhysicsCatherine · March 15, 2010, 12:06am

I wrote a simple runge-kutta solver using CUDA, and a Java version using floats. In the CUDA version, a number gets rounded up to 0.010000 which resolves to 0.0099975 in my Java code. This doesn’t make sense to me, since float is an IEEE standard. Also, when I change all my floats to doubles and compile using -arch sm_13, the error doesn’t go away. Any thoughts?

LSChien · March 15, 2010, 1:37am

there are two reasons of rounding error when compare GPU’s result to CPU’s result

(1) non-commutative property of floating operation

( a + b ) + c is not equal to a + ( b + c )

(2) fused muliply and add in GPU

consider

Csub += AS(ty, k) * BS(k, tx);

nvcc use one MAD to compute it.

However if you want to match CPU’s result, you must use

#ifdef DO_DOUBLE

			Csub = Csub + __dmul_rn( AS(ty, k), BS(k, tx));

#else

			Csub = Csub + __fmul_rn( AS(ty, k), BS(k, tx));		

#endif

please see section A.2 in programming guide 2.3

You can test above two reasons in matrix multiplication example.

#if defined(DO_DOUBLE)

	typedef double doublereal;	

#else

	typedef float doublereal;

#endif

#define CHECK_BANK_CONFLICTS 0

#if CHECK_BANK_CONFLICTS

#define AS(i, j) CUT_BANK_CHECKER(((doublereal*)&As[0][0]), (BLOCK_SIZE * i + j))

#define BS(i, j) CUT_BANK_CHECKER(((doublereal*)&Bs[0][0]), (BLOCK_SIZE * i + j))

#else

#define AS(i, j) As[i][j]

#define BS(i, j) Bs[i][j]

#endif

__global__ void matrixMul( doublereal* C, doublereal* A, doublereal* B, int wA, int wB);

void  matrixMul_wrapper( doublereal* d_C, doublereal* d_A, doublereal* d_B, 

	int hA, int wA, int wB) 

{

	int hC = hA;

	int wC = wB;

	

	// setup execution parameters

	dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

	dim3 grid(wC / threads.x, hC / threads.y);

	// execute the kernel

	matrixMul<<< grid, threads >>>(d_C, d_A, d_B, wA, wB);

}	

__global__ void matrixMul( doublereal* C, doublereal* A, doublereal* B, int wA, int wB)

{

	// Block index

	int bx = blockIdx.x;

	int by = blockIdx.y;

	// Thread index

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	// Index of the first sub-matrix of A processed by the block

	int aBegin = wA * BLOCK_SIZE * by;

	// Index of the last sub-matrix of A processed by the block

	int aEnd   = aBegin + wA - 1;

	// Step size used to iterate through the sub-matrices of A

	int aStep  = BLOCK_SIZE;

	// Index of the first sub-matrix of B processed by the block

	int bBegin = BLOCK_SIZE * bx;

	// Step size used to iterate through the sub-matrices of B

	int bStep  = BLOCK_SIZE * wB;

	// Csub is used to store the element of the block sub-matrix

	// that is computed by the thread

	doublereal  Csub = 0.0;

	// Loop over all the sub-matrices of A and B

	// required to compute the block sub-matrix

	for (int a = aBegin, b = bBegin;

			 a <= aEnd;

			 a += aStep, b += bStep) {

		// Declaration of the shared memory array As used to

		// store the sub-matrix of A

		__shared__ doublereal As[BLOCK_SIZE][BLOCK_SIZE];

		// Declaration of the shared memory array Bs used to

		// store the sub-matrix of B

		__shared__ doublereal Bs[BLOCK_SIZE][BLOCK_SIZE];

		// Load the matrices from device memory

		// to shared memory; each thread loads

		// one element of each matrix

		AS(ty, tx) = A[a + wA * ty + tx];

		BS(ty, tx) = B[b + wB * ty + tx];

		// Synchronize to make sure the matrices are loaded

		__syncthreads();

		// Multiply the two matrices together;

		// each thread computes one element

		// of the block sub-matrix

		for (int k = 0; k < BLOCK_SIZE; ++k){

			Csub += AS(ty, k) * BS(k, tx);	

		}

		// Synchronize to make sure that the preceding

		// computation is done before loading two new

		// sub-matrices of A and B in the next iteration

		__syncthreads();

	}

	// Write the block sub-matrix to device memory;

	// each thread writes one element

	int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

	C[c + wB * ty + tx] = Csub;

}

YDD · March 15, 2010, 1:33pm

A couple of other thoughts… if you’re on an x86 processor, and you aren’t forcing usage of SSE, it’s possible that all your FP operations on the CPU are going through the x87 unit. That uses 80-bit precision for everything, and hence will give different answers. Also, IIRC, all negative powers of 10 are infinitely recurring decimals in binary, so getting a result of 0.01000000 means that something got rounded somewhere (probably on output).

SPichardo · March 15, 2010, 2:52pm

Indeed, in modern X86 compilers depending on your compiler flags, many of the internal operations use double precision and only the result is single precision. This situation of difference of internal precision is not only common to X86 vs GPU, but also X86 vs others architectures ( [url=“Deterministic cross-platform floating point arithmetics”]http://www.christian-seiler.de/projekte/fpmath/[/url] ).

Sam

seibert · March 15, 2010, 5:48pm

My apologies for nit-picking, but this is actually the “non-associativity” property of floating point operations. I believe floating point is still commutative. That is to say: a + b = b + a.

tera · March 15, 2010, 9:19pm

Try printing the CUDA result with more digits:
[font=“Courier New”] printf(“%.10f\n”, x);[/font]

Topic		Replies	Views
Floats and floats... difference between CPU and GPU? CUDA Programming and Performance	12	14103	February 2, 2010
CPU and GPU Floating point anomaly CUDA Programming and Performance	10	5642	November 10, 2013
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10676	November 26, 2009
Possible Rounding/Precision Errors in CUDA Math APIs? GPU-Accelerated Libraries math-api	5	103	July 31, 2024
Why accuracy CPU and GPU not equal? CUDA Programming and Performance	6	10946	October 28, 2014
precision CUDA Programming and Performance	3	2614	December 16, 2008
Float precision error in matrix multiplication application. CUDA Programming and Performance	14	3560	February 27, 2014
floating point precision on CUDA CUDA Programming and Performance	11	14768	June 8, 2010
Float accuracy CUDA Programming and Performance	16	9368	July 22, 2010
Precision Problem CUDA Programming and Performance	2	1725	June 21, 2009

Precision of floats does CUDA use half precision instead of single precision for floats?

Related topics