# Precision of floats does CUDA use half precision instead of single precision for floats?

I wrote a simple runge-kutta solver using CUDA, and a Java version using floats. In the CUDA version, a number gets rounded up to 0.010000 which resolves to 0.0099975 in my Java code. This doesn’t make sense to me, since float is an IEEE standard. Also, when I change all my floats to doubles and compile using -arch sm_13, the error doesn’t go away. Any thoughts?

there are two reasons of rounding error when compare GPU’s result to CPU’s result

(1) non-commutative property of floating operation

``````( a + b ) + c is not equal to a + ( b + c )
``````

(2) fused muliply and add in GPU

consider

``````Csub += AS(ty, k) * BS(k, tx);
``````

nvcc use one MAD to compute it.

However if you want to match CPU’s result, you must use

``````#ifdef DO_DOUBLE

Csub = Csub + __dmul_rn( AS(ty, k), BS(k, tx));

#else

Csub = Csub + __fmul_rn( AS(ty, k), BS(k, tx));

#endif
``````

please see section A.2 in programming guide 2.3

You can test above two reasons in matrix multiplication example.

``````#if defined(DO_DOUBLE)

typedef double doublereal;

#else

typedef float doublereal;

#endif

#define CHECK_BANK_CONFLICTS 0

#if CHECK_BANK_CONFLICTS

#define AS(i, j) CUT_BANK_CHECKER(((doublereal*)&As), (BLOCK_SIZE * i + j))

#define BS(i, j) CUT_BANK_CHECKER(((doublereal*)&Bs), (BLOCK_SIZE * i + j))

#else

#define AS(i, j) As[i][j]

#define BS(i, j) Bs[i][j]

#endif

__global__ void matrixMul( doublereal* C, doublereal* A, doublereal* B, int wA, int wB);

void  matrixMul_wrapper( doublereal* d_C, doublereal* d_A, doublereal* d_B,

int hA, int wA, int wB)

{

int hC = hA;

int wC = wB;

// setup execution parameters

// execute the kernel

matrixMul<<< grid, threads >>>(d_C, d_A, d_B, wA, wB);

}

__global__ void matrixMul( doublereal* C, doublereal* A, doublereal* B, int wA, int wB)

{

// Block index

int bx = blockIdx.x;

int by = blockIdx.y;

// Index of the first sub-matrix of A processed by the block

int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block

int aEnd   = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A

int aStep  = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block

int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B

int bStep  = BLOCK_SIZE * wB;

// Csub is used to store the element of the block sub-matrix

// that is computed by the thread

doublereal  Csub = 0.0;

// Loop over all the sub-matrices of A and B

// required to compute the block sub-matrix

for (int a = aBegin, b = bBegin;

a <= aEnd;

a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

// store the sub-matrix of A

__shared__ doublereal As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

// store the sub-matrix of B

__shared__ doublereal Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

// one element of each matrix

AS(ty, tx) = A[a + wA * ty + tx];

BS(ty, tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

// Multiply the two matrices together;

// each thread computes one element

// of the block sub-matrix

for (int k = 0; k < BLOCK_SIZE; ++k){

Csub += AS(ty, k) * BS(k, tx);

}

// Synchronize to make sure that the preceding

// sub-matrices of A and B in the next iteration

}

// Write the block sub-matrix to device memory;

// each thread writes one element

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = Csub;

}
``````

A couple of other thoughts… if you’re on an x86 processor, and you aren’t forcing usage of SSE, it’s possible that all your FP operations on the CPU are going through the x87 unit. That uses 80-bit precision for everything, and hence will give different answers. Also, IIRC, all negative powers of 10 are infinitely recurring decimals in binary, so getting a result of 0.01000000 means that something got rounded somewhere (probably on output).

Indeed, in modern X86 compilers depending on your compiler flags, many of the internal operations use double precision and only the result is single precision. This situation of difference of internal precision is not only common to X86 vs GPU, but also X86 vs others architectures ( http://www.christian-seiler.de/projekte/fpmath/ ).

Sam

My apologies for nit-picking, but this is actually the “non-associativity” property of floating point operations. I believe floating point is still commutative. That is to say: a + b = b + a.

Try printing the CUDA result with more digits:
[font=“Courier New”] printf("%.10f\n", x);[/font]