I wrote a simple runge-kutta solver using CUDA, and a Java version using floats. In the CUDA version, a number gets rounded up to 0.010000 which resolves to 0.0099975 in my Java code. This doesn’t make sense to me, since float is an IEEE standard. Also, when I change all my floats to doubles and compile using -arch sm_13, the error doesn’t go away. Any thoughts?
there are two reasons of rounding error when compare GPU’s result to CPU’s result
(1) non-commutative property of floating operation
( a + b ) + c is not equal to a + ( b + c )
(2) fused muliply and add in GPU
consider
Csub += AS(ty, k) * BS(k, tx);
nvcc use one MAD to compute it.
However if you want to match CPU’s result, you must use
#ifdef DO_DOUBLE
Csub = Csub + __dmul_rn( AS(ty, k), BS(k, tx));
#else
Csub = Csub + __fmul_rn( AS(ty, k), BS(k, tx));
#endif
please see section A.2 in programming guide 2.3
You can test above two reasons in matrix multiplication example.
#if defined(DO_DOUBLE)
typedef double doublereal;
#else
typedef float doublereal;
#endif
#define CHECK_BANK_CONFLICTS 0
#if CHECK_BANK_CONFLICTS
#define AS(i, j) CUT_BANK_CHECKER(((doublereal*)&As[0][0]), (BLOCK_SIZE * i + j))
#define BS(i, j) CUT_BANK_CHECKER(((doublereal*)&Bs[0][0]), (BLOCK_SIZE * i + j))
#else
#define AS(i, j) As[i][j]
#define BS(i, j) Bs[i][j]
#endif
__global__ void matrixMul( doublereal* C, doublereal* A, doublereal* B, int wA, int wB);
void matrixMul_wrapper( doublereal* d_C, doublereal* d_A, doublereal* d_B,
int hA, int wA, int wB)
{
int hC = hA;
int wC = wB;
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(wC / threads.x, hC / threads.y);
// execute the kernel
matrixMul<<< grid, threads >>>(d_C, d_A, d_B, wA, wB);
}
__global__ void matrixMul( doublereal* C, doublereal* A, doublereal* B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
doublereal Csub = 0.0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
// Declaration of the shared memory array As used to
// store the sub-matrix of A
__shared__ doublereal As[BLOCK_SIZE][BLOCK_SIZE];
// Declaration of the shared memory array Bs used to
// store the sub-matrix of B
__shared__ doublereal Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from device memory
// to shared memory; each thread loads
// one element of each matrix
AS(ty, tx) = A[a + wA * ty + tx];
BS(ty, tx) = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k){
Csub += AS(ty, k) * BS(k, tx);
}
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}
A couple of other thoughts… if you’re on an x86 processor, and you aren’t forcing usage of SSE, it’s possible that all your FP operations on the CPU are going through the x87 unit. That uses 80-bit precision for everything, and hence will give different answers. Also, IIRC, all negative powers of 10 are infinitely recurring decimals in binary, so getting a result of 0.01000000 means that something got rounded somewhere (probably on output).
Indeed, in modern X86 compilers depending on your compiler flags, many of the internal operations use double precision and only the result is single precision. This situation of difference of internal precision is not only common to X86 vs GPU, but also X86 vs others architectures ( [url=“Deterministic cross-platform floating point arithmetics”]http://www.christian-seiler.de/projekte/fpmath/[/url] ).
Sam
My apologies for nit-picking, but this is actually the “non-associativity” property of floating point operations. I believe floating point is still commutative. That is to say: a + b = b + a.
Try printing the CUDA result with more digits:
[font=“Courier New”] printf(“%.10f\n”, x);[/font]