How did you compute that? I get 164.5. For half-way cases round
rounds away from zero, so the result is 165.
Now, when I do this on the GPU using code compiled with the CUDA 11.8 toolchain, I get 164.499985, which rounds to 164. Note that the intermediate result differs by 1 ulp from 164.5.
As opposed to math, floating-point arithmetic is not associative, so the order of operations can have an influence on the result. By default, the CUDA compiler turns on FMA contraction (same as the user specifying -fmad=true
). In this case, it turns the computation into:
gray = __fmaf_rn (float(b), 0.114f,
__fmaf_rn (float(r), 0.299f,
__fmul_rn (float(g), 0.587f)));
In general, FMA contraction is good for performance (in the above, we need only three floating-point operations instead of five), and on average it is also good for accuracy, as it reduces the overall rounding error and provides some protection against subtractive cancellation. But it can also cause results to differ from a desired reference result, as is the case here.
Instead of turning off FMA contraction altogether (-fmad=false
), CUDA programmers can suppress it locally by the use of intrinsics, which the compiler will no include in FMA merging. So in this case we could code:
gray = (__fmul_rn (float(r), 0.299f) +
__fmul_rn (float(g), 0.587f) +
__fmul_rn (float(b), 0.114f));
I also get the desired result for the particular case from the original post when I code this FMA-based version, but of course it could result in tiny differences for other input combinations:
gray = __fmaf_rn (float(r), 0.299f,
__fmaf_rn (float(g), 0.587f,
__fmul_rn (float(b), 0.114f)));
In general, it is not a good idea to rely on bit-identical processing in floating-point computation, but I understand it may sometimes be necessary to match a “golden” reference. Not sure whether that is the case here: Does anything bad happen if the grayscale value comes out as 164 instead of 165? A human is unlikely to notice.
I am appending my little test program for reference.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
__global__ void kernel (unsigned char r, unsigned char g, unsigned char b)
{
float gray = (float(r) * 0.299f + float(g) * 0.587f + float(b) * 0.114f);
printf ("gray=%15.8e\n", gray);
gray = __fmaf_rn (float(b), 0.114f,
__fmaf_rn (float(r), 0.299f,
__fmul_rn (float(g), 0.587f)));
printf ("gray=%15.8e\n", gray);
gray = (__fmul_rn (float(r), 0.299f) +
__fmul_rn (float(g), 0.587f) +
__fmul_rn (float(b), 0.114f));
printf ("gray=%15.8e\n", gray);
gray = __fmaf_rn (float(r), 0.299f,
__fmaf_rn (float(g), 0.587f,
__fmul_rn (float(b), 0.114f)));
printf ("gray=%15.8e\n", gray);
}
int main (void)
{
kernel<<<1,1>>>(172, 160, 168);
cudaDeviceSynchronize();
return EXIT_SUCCESS;
}