cuda float point precision

Can someone comment on this,

I want to do a vector dot product. My float vector are [2080:2131] and [2112:2163], each one of them contains 52 elements.

a[52] = {2080 2081 2082 ... ... 2129 2130 2131};

b[52] = {2112 2113 2114 ... ... 2161 2162 2163};

for (int i = 0; i < 52; i++)

{

    sum += a[i]*b[i];

}

The result sum for whole length (52 element)was 234038032 by my kernel while matlab gave 234038038. For 1 to 9 element sum of product, my kernel result agrees with matlab result. For 10 element sum, it is off by 1 and gradually increases. The results were reproducible. I checked all the elements and found no problem.

Can someone comment on this,

I want to do a vector dot product. My float vector are [2080:2131] and [2112:2163], each one of them contains 52 elements.

a[52] = {2080 2081 2082 ... ... 2129 2130 2131};

b[52] = {2112 2113 2114 ... ... 2161 2162 2163};

for (int i = 0; i < 52; i++)

{

    sum += a[i]*b[i];

}

The result sum for whole length (52 element)was 234038032 by my kernel while matlab gave 234038038. For 1 to 9 element sum of product, my kernel result agrees with matlab result. For 10 element sum, it is off by 1 and gradually increases. The results were reproducible. I checked all the elements and found no problem.

Each of the partial products requires about 22 bits to store exactly. A single-precision number has 24 mantissa bits. So one can add four 22-bit numbers before the result can no longer be stored exactly, but needs to be rounded during each subsequent addition. As to why the result of this computation on the GPU differs from MATLAB, there are two possibilities (both of which may occur simultaneously):

(1) MATLAB may perform intermediate computation in double precision (even if the final result is single precision, the accumulation of the dot product may happen in a higher precision)
(2) The CUDA compiler is extremely likely to map each step of dot-product accumulation to an FMAD (sm_1x) or FMA (sm_2x). In either case the result will likely differ from first computing the partial product with an FMUL, then accumulating with an FADD. The FMAD differs from an FMUL/FADD sequence (two roundings) by computing the product, truncating the result to single-precision, then performing an addition with rounding. The FMA (fused multiply-add) differs from an FMUL/FADD sequence by computing the full double-wide product, then adding and rounding (i.e. single rounding). This latter variant is on average the most accurate for a given precision.

To eliminate any effects from item (2) for the sake of experiment, you could recode the dot product accumulation step as

sum = __fadd_rn(__fmul_rn(a[i], b[i]), sum);

Each of the partial products requires about 22 bits to store exactly. A single-precision number has 24 mantissa bits. So one can add four 22-bit numbers before the result can no longer be stored exactly, but needs to be rounded during each subsequent addition. As to why the result of this computation on the GPU differs from MATLAB, there are two possibilities (both of which may occur simultaneously):

(1) MATLAB may perform intermediate computation in double precision (even if the final result is single precision, the accumulation of the dot product may happen in a higher precision)
(2) The CUDA compiler is extremely likely to map each step of dot-product accumulation to an FMAD (sm_1x) or FMA (sm_2x). In either case the result will likely differ from first computing the partial product with an FMUL, then accumulating with an FADD. The FMAD differs from an FMUL/FADD sequence (two roundings) by computing the product, truncating the result to single-precision, then performing an addition with rounding. The FMA (fused multiply-add) differs from an FMUL/FADD sequence by computing the full double-wide product, then adding and rounding (i.e. single rounding). This latter variant is on average the most accurate for a given precision.

To eliminate any effects from item (2) for the sake of experiment, you could recode the dot product accumulation step as

sum = __fadd_rn(__fmul_rn(a[i], b[i]), sum);