The native instructions is more effective than the regular one, as pointed out in the CUDA C programming guide. I’m not sure if CUDA Fortran utilize these functions or not, or there is a way that user can choose to use the native instructions, rather the normal one?
(Sect.5.4.1)
Single-Precision Floating-Point Addition and Multiplication Intrinsics
__fadd_r[d,u], __fmul_r[d,u], and __fmaf_r[n,z,d,u] (see Section C.2.1) compile to tens of instructions for devices of compute capability 1.x, but map to a single native instruction for devices of compute capability 2.0.
Single-Precision Floating-Point Division
__fdividef(x, y) (see Section C.2.1) provides faster single-precision floating-point division than the division operator.
Thanks
Tuan