cuda float point precision

small_potato · November 5, 2010, 6:42am

Can someone comment on this,

I want to do a vector dot product. My float vector are [2080:2131] and [2112:2163], each one of them contains 52 elements.

a[52] = {2080 2081 2082 ... ... 2129 2130 2131};

b[52] = {2112 2113 2114 ... ... 2161 2162 2163};

for (int i = 0; i < 52; i++)

{

    sum += a[i]*b[i];

}

The result sum for whole length (52 element)was 234038032 by my kernel while matlab gave 234038038. For 1 to 9 element sum of product, my kernel result agrees with matlab result. For 10 element sum, it is off by 1 and gradually increases. The results were reproducible. I checked all the elements and found no problem.

small_potato · November 5, 2010, 6:42am

Can someone comment on this,

I want to do a vector dot product. My float vector are [2080:2131] and [2112:2163], each one of them contains 52 elements.

a[52] = {2080 2081 2082 ... ... 2129 2130 2131};

b[52] = {2112 2113 2114 ... ... 2161 2162 2163};

for (int i = 0; i < 52; i++)

{

    sum += a[i]*b[i];

}

The result sum for whole length (52 element)was 234038032 by my kernel while matlab gave 234038038. For 1 to 9 element sum of product, my kernel result agrees with matlab result. For 10 element sum, it is off by 1 and gradually increases. The results were reproducible. I checked all the elements and found no problem.

njuffa · November 5, 2010, 8:34am

Each of the partial products requires about 22 bits to store exactly. A single-precision number has 24 mantissa bits. So one can add four 22-bit numbers before the result can no longer be stored exactly, but needs to be rounded during each subsequent addition. As to why the result of this computation on the GPU differs from MATLAB, there are two possibilities (both of which may occur simultaneously):

(1) MATLAB may perform intermediate computation in double precision (even if the final result is single precision, the accumulation of the dot product may happen in a higher precision)
(2) The CUDA compiler is extremely likely to map each step of dot-product accumulation to an FMAD (sm_1x) or FMA (sm_2x). In either case the result will likely differ from first computing the partial product with an FMUL, then accumulating with an FADD. The FMAD differs from an FMUL/FADD sequence (two roundings) by computing the product, truncating the result to single-precision, then performing an addition with rounding. The FMA (fused multiply-add) differs from an FMUL/FADD sequence by computing the full double-wide product, then adding and rounding (i.e. single rounding). This latter variant is on average the most accurate for a given precision.

To eliminate any effects from item (2) for the sake of experiment, you could recode the dot product accumulation step as

sum = __fadd_rn(__fmul_rn(a[i], b[i]), sum);

njuffa · November 5, 2010, 8:34am

Each of the partial products requires about 22 bits to store exactly. A single-precision number has 24 mantissa bits. So one can add four 22-bit numbers before the result can no longer be stored exactly, but needs to be rounded during each subsequent addition. As to why the result of this computation on the GPU differs from MATLAB, there are two possibilities (both of which may occur simultaneously):

(1) MATLAB may perform intermediate computation in double precision (even if the final result is single precision, the accumulation of the dot product may happen in a higher precision)
(2) The CUDA compiler is extremely likely to map each step of dot-product accumulation to an FMAD (sm_1x) or FMA (sm_2x). In either case the result will likely differ from first computing the partial product with an FMUL, then accumulating with an FADD. The FMAD differs from an FMUL/FADD sequence (two roundings) by computing the product, truncating the result to single-precision, then performing an addition with rounding. The FMA (fused multiply-add) differs from an FMUL/FADD sequence by computing the full double-wide product, then adding and rounding (i.e. single rounding). This latter variant is on average the most accurate for a given precision.

To eliminate any effects from item (2) for the sake of experiment, you could recode the dot product accumulation step as

sum = __fadd_rn(__fmul_rn(a[i], b[i]), sum);

Topic		Replies	Views
precision CUDA Programming and Performance	3	2614	December 16, 2008
CUDA dot product atomics problem CUDA Programming and Performance	4	1848	February 26, 2012
Prevent ncc from applying MADD optimization precision, IEEE 754 CUDA Programming and Performance	11	10994	April 6, 2009
Several questions on cuda (arithmetic, rounding, for loop ad performance) CUDA Programming and Performance	8	3330	April 13, 2023
float/int question CUDA Programming and Performance	3	1509	February 22, 2009
floating point precision on CUDA CUDA Programming and Performance	11	14713	June 8, 2010
solved: cuda single precision code accumulates error while glsl/vulkan works, why? CUDA Programming and Performance	5	937	November 5, 2018
Precision issue! Wrong result for a multiplication CUDA Programming and Performance	7	1355	April 11, 2012
Maximum native precision or mixed-precision to have 21 digit floating point type? CUDA Programming and Performance mixed-precision	1	670	January 14, 2022
Precision Problem CUDA Programming and Performance	2	1725	June 21, 2009

cuda float point precision

Related topics