Each of the partial products requires about 22 bits to store exactly. A single-precision number has 24 mantissa bits. So one can add four 22-bit numbers before the result can no longer be stored exactly, but needs to be rounded during each subsequent addition. As to why the result of this computation on the GPU differs from MATLAB, there are two possibilities (both of which may occur simultaneously):

(1) MATLAB may perform intermediate computation in double precision (even if the final result is single precision, the accumulation of the dot product may happen in a higher precision)

(2) The CUDA compiler is extremely likely to map each step of dot-product accumulation to an FMAD (sm_1x) or FMA (sm_2x). In either case the result will likely differ from first computing the partial product with an FMUL, then accumulating with an FADD. The FMAD differs from an FMUL/FADD sequence (two roundings) by computing the product, truncating the result to single-precision, then performing an addition with rounding. The FMA (fused multiply-add) differs from an FMUL/FADD sequence by computing the full double-wide product, then adding and rounding (i.e. single rounding). This latter variant is on average the most accurate for a given precision.

To eliminate any effects from item (2) for the sake of experiment, you could recode the dot product accumulation step as

sum = __fadd_rn(__fmul_rn(a[i], b[i]), sum);