On tackling float-point precision issues in CUDA

I am not aware of an fmad() intrinsic in CUDA. I would suggest using the C++ standard functions fma() and fmaf() as needed, as such code should be portable between host and device. Occasionally the specific use of device function intrinsics like __fma_rn() and __fmaf_rn() may be useful.

I concur that educating oneself about the advantages of the fused multiply-add operation is highly recommended in general, as it can be a powerful tool in numerical codes. Knowledge about it is not as widespread among programmers as it should be, given that “all” modern processor architectures (both CPUs and GPUs) support the operation in hardware.