floating point precision on CUDA

xargon · June 7, 2010, 7:46pm

Hello,

What is the floating point precision on CUDA? What standard does it follow? I have a bit of code which does some accumulation of values and I am getting quite different results between CPU and emulation CUDA code and the error is really then blowing up.

Does anyone know of some standard practices to mitigate this?

Thanks,
Luca

YDD · June 7, 2010, 8:09pm

Can you be more specific? GPU arithmetic is supposed to be (largely) IEEE compliant. And if you’re running in emulation, then you’re using the CPU’s floating point representation anyway, so the GPU’s FP units are irrelevant. Based on the vague statement

I’m guessing that you’re summing all the elements in an array?

xargon · June 7, 2010, 8:21pm

Yes, I am actually been struggling quite a bit with this:

So, I have a simple device function that does the following:

[codebox]

device void device_acc(const int &n, const float *x, const int & incx, const float *y, const int &incy, float &result)

result = 0.0f;

for (int i = 0; i < n; ++i) {

    result+=x[i*incx]*y[i*incy];

}

[/codebox]

This, as you might imagine, is causing a world of grief. Here are the corresponding outputs from CPU and GPU (the output is from printf):

GPU:

result x[i*incx] y[i*incy] x[i*incx]*y[i*incy]

0.000000: 188.459503: -0.000000: -0.000022

-0.000022: 0.000000: -70.000000: -0.000030

-0.000052: -0.000000: 84.000000: -0.000019

-0.000071: -0.000000: 70.000000: -0.000029

-0.000099: -77.948471: -0.000001: 0.000068

-0.000031: -7.794848: 0.000000: -0.000000

-0.000031: 0.000000: 1.000000: 0.000000

CPU:

result x[i*incx] y[i*incy] x[i*incx]*y[i*incy]

0.000000: 188.459503: -0.000000: -0.000007

-0.000007: -0.000000: -70.000000: 0.000007

0.000000: 0.000000: 84.000000: 0.000010

0.000010: 0.000000: 70.000000: 0.000007

0.000017: 77.948486: -0.000000: -0.000027

-0.000010: -7.794847: 0.000000: -0.000000

-0.000010: -0.000000: 1.000000: -0.000000

This actually results in error blowing up really out of proportion. Is there any way to control this?

Thanks,

Luca

YDD · June 7, 2010, 8:27pm

You’ll need to add the kernel which calls that device function before you can get any help…

seibert · June 7, 2010, 8:44pm

To avoid round-off error in long sums, there are two options: use double precision, or use Kahan summation. Double precision requires a compute capability 1.3 device or greater, and you have to pass the -arch sm_13 flag to the nvcc compiler or the doubles will be automatically downgraded to floats.

Kahan summation is a standard trick (see Wikipedia) that uses two floats, one for the sum and one for accumulated error on the sum. On most CUDA devices, except the C2050, this should be about twice as fast as using doubles for accumulation, although it is a little harder to read.

xargon · June 7, 2010, 9:15pm

Thanks for the tip on the Kahan summation. However, wouldn’t the multiplication bit in my kernel also generate significant errors? I think the significant error in my case is coming from the multiplication of the two floats when one or both of the floats are quite small.

Thanks again.

Luca

xargon · June 7, 2010, 9:27pm

The kernel is actually quite simple. Here it is:

x and y is the same matrix and I am multiplying some part of the column vectors together. The IDX2C macro is the standard indexing as described in CUBLAS manual.

[codebox]

global void kernel_mat(int *n, const float *x, int *incx, const float *y, int *incy,

                            int *num_blocks, int *i, int *i1, float *factor, float *gpu_data)

{

const int tid = (blockIdx.x * blockDim.x + threadIdx.x) + (blockIdx.y * gridDim.x);

if (tid < (*num_blocks)){

    int k_incx = *incx;

    int k_incy = *incy;

    int k_n = *n;

    int k_i = *i;

    int k_i1 = *i1;

    float result = 0.0f;

    device_acc(k_n, &x[IDX2C(k_i, k_i1, c_num_equations)], k_incx,

                  &y[IDX2C(k_i1+tid, k_i1, c_num_equations)], k_incy,

                  result);

    gpu_data[tid] = result/(*factor);

}

}

[/codebox]

eelsen · June 7, 2010, 9:29pm

Other options include changing the order of summation (like sort and sum from smallest to biggest) or use a reduction method which will also reduce error by tending to keep summed terms approximately the same order of magnitude.

seibert · June 7, 2010, 10:32pm

Multiplication in floating point grows relative error at a low and predictable rate (unless you are over/underflowing). It is addition of two values with different magnitudes that grows the error very fast, and this is often the case in long sums.

xargon · June 7, 2010, 10:45pm

Cool. Good to know. Many thanks for this. I will give it a shot tomorrow. The reason I was wondering was if you look at the first calculation in the output:

[codebox]

GPU:

x y x * y

188.459503: -0.000000: -0.000022

CPU:

x y x * y

188.459503: -0.000000: -0.000007

[/codebox]

I am not sure if using printf() with higher precision would tell me much but maybe there is an underflow going on somewhere.

Anyway, I will implement this tomorrow and see what happens.

One last general question:

Does CUBLAS does anything like this to reduce computation errors. Say, I have a long vector and I use cublasSdot to compute the dot product. Would the results be summed up with such a technique?

Many thanks for your help.

Luca

tera · June 8, 2010, 12:00am

Just use %g instead of %f as the format to print the variables, so you can still see the first few significant digits of very small values.

xargon · June 8, 2010, 12:13am

Thanks for that. Have not used printf() in ages!

Luca

Topic		Replies	Views
Maximum native precision or mixed-precision to have 21 digit floating point type? CUDA Programming and Performance mixed-precision	1	671	January 14, 2022
INACCURACY OF FLOAT DATA TYPE FLOAT DATA TYPE BECOME INACCURATE NEAR ABOUT 2^15 CUDA Programming and Performance	12	2452	July 10, 2010
floating point error Error with floating point division CUDA Programming and Performance	9	8392	November 30, 2007
Accumulation Difference between CUDA and CPU? CUDA Programming and Performance	7	10470	August 22, 2009
Working with large numbers Help to calculate an harmonic sum CUDA Programming and Performance	4	1732	June 23, 2009
how to improve float array summation precision and stability? CUDA Programming and Performance	9	3320	January 15, 2019
Prevent ncc from applying MADD optimization precision, IEEE 754 CUDA Programming and Performance	11	10996	April 6, 2009
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10463	December 8, 2009
CUDA dot product atomics problem CUDA Programming and Performance	4	1851	February 26, 2012
Precision of floats does CUDA use half precision instead of single precision for floats? CUDA Programming and Performance	5	2280	March 15, 2010

floating point precision on CUDA

Related topics