global void vector_add(float *out, float *a, float *b, float *c, float *d, int n) {
__half tmp1 = (__hmul(__float2half(a[0]), __float2half(b[0])));
__half tmp2 = (__hmul(__float2half(c[0]), __float2half(d[0])));
out[0] = (__half2float(tmp1) - __half2float(tmp2));
// out[0] = __half2float(__hsub(__hmul(__float2half(a[0]), __float2half(b[0])),
// __hmul(__float2half(c[0]), __float2half(d[0]))));
}
In the above code snippet, If I use temparory variable tmp1 and tmp2 to save intermediate results, the final result is different from the expression in just one statement which is commented in the code.
This subtraction,
__half2float(tmp1) - __half2float(tmp2)
is a single-precision (FP32) subtraction, whereas this subtraction,
__hsub ( <fp16_expression_1>, <fp16_expression_2> )
is a half-precision (FP16) subtraction. For many values of <fp16_expression_1> , <fp16_expression_2> these are not equivalent. I would expect the first method to deliver more accurate results on average, especially when the magnitudes of the two expressions are quite different.
Beyond this specific issue, be aware that floating-point arithmetic is not associative, and that therefore mathematically equivalent expressions are generally not equivalent when evaluated with finite-precision floating-point arithmetic.
Furthermore, for the second expression the compiler is likely to contract the multiply-subtract sequence into an FP16 fused multiply-add (FMA) operation, leading to further numerical differences. This contraction does not apply to the first example since the products and the subtraction are computed with different precisions, with a type conversion in between. If the FMA is generated (use cuobjdump --dump-sass to find out), the average accuracy of the two variants is likely very similar (I have not thought through all potential cases). However, the results would still not be a bit-wise match to the first variant, since one of the products would be computed fully accurate, while the other would be rounded to FP16.
@njuffa Massive thanks for your reply. In fact I made a mistake when copying the code. In the original version of the first example, the substraction is calculated in fp16 presion and the version in this post is a debugging version which I tried to generate the same output with the second example. However, the results of the two examples are not identical either the precision of the substraction in the first example, as the difference is mainly caused by the precision of multiplication between mul and FMA.
For debugging purposes, you could turn off FMA merging by using the command line switch -fmad=false with nvcc. However, in general FMA generation should remain enabled for best average accuracy and best performance.
@njuffa Really study a lot. Thank you very much!