What is the floating point precision on CUDA? What standard does it follow? I have a bit of code which does some accumulation of values and I am getting quite different results between CPU and emulation CUDA code and the error is really then blowing up.
Does anyone know of some standard practices to mitigate this?
Can you be more specific? GPU arithmetic is supposed to be (largely) IEEE compliant. And if you’re running in emulation, then you’re using the CPU’s floating point representation anyway, so the GPU’s FP units are irrelevant. Based on the vague statement
I’m guessing that you’re summing all the elements in an array?
To avoid round-off error in long sums, there are two options: use double precision, or use Kahan summation. Double precision requires a compute capability 1.3 device or greater, and you have to pass the -arch sm_13 flag to the nvcc compiler or the doubles will be automatically downgraded to floats.
Kahan summation is a standard trick (see Wikipedia) that uses two floats, one for the sum and one for accumulated error on the sum. On most CUDA devices, except the C2050, this should be about twice as fast as using doubles for accumulation, although it is a little harder to read.
Thanks for the tip on the Kahan summation. However, wouldn’t the multiplication bit in my kernel also generate significant errors? I think the significant error in my case is coming from the multiplication of the two floats when one or both of the floats are quite small.
x and y is the same matrix and I am multiplying some part of the column vectors together. The IDX2C macro is the standard indexing as described in CUBLAS manual.
[codebox]
global void kernel_mat(int *n, const float *x, int *incx, const float *y, int *incy,
int *num_blocks, int *i, int *i1, float *factor, float *gpu_data)
{
const int tid = (blockIdx.x * blockDim.x + threadIdx.x) + (blockIdx.y * gridDim.x);
if (tid < (*num_blocks)){
int k_incx = *incx;
int k_incy = *incy;
int k_n = *n;
int k_i = *i;
int k_i1 = *i1;
float result = 0.0f;
device_acc(k_n, &x[IDX2C(k_i, k_i1, c_num_equations)], k_incx,
&y[IDX2C(k_i1+tid, k_i1, c_num_equations)], k_incy,
result);
gpu_data[tid] = result/(*factor);
}
Other options include changing the order of summation (like sort and sum from smallest to biggest) or use a reduction method which will also reduce error by tending to keep summed terms approximately the same order of magnitude.
Multiplication in floating point grows relative error at a low and predictable rate (unless you are over/underflowing). It is addition of two values with different magnitudes that grows the error very fast, and this is often the case in long sums.
Cool. Good to know. Many thanks for this. I will give it a shot tomorrow. The reason I was wondering was if you look at the first calculation in the output:
[codebox]
GPU:
x y x * y
188.459503: -0.000000: -0.000022
CPU:
x y x * y
188.459503: -0.000000: -0.000007
[/codebox]
I am not sure if using printf() with higher precision would tell me much but maybe there is an underflow going on somewhere.
Anyway, I will implement this tomorrow and see what happens.
One last general question:
Does CUBLAS does anything like this to reduce computation errors. Say, I have a long vector and I use cublasSdot to compute the dot product. Would the results be summed up with such a technique?