Without -use_fast_math, I see that the saturation is successfully merged with the addition for result1. This looks like a case where the merging does not happen with the .ftz variant of fadd. I cannot think of a reason for this. The optimization may erroneously not be applied for variants other than the default variant of fadd. I would file a bug for this.
I don’t think the expectation is realistic because 6.2831854820251464844 * 0.15915493667125701904 != 1.0. As for why the two multiplications are not merged: there may be a phase-ordering issue where __cosf is expanded late (the expansion is GPU architecture specific; on older architectures this expands into RRO, MUFU.COS), after constant propagation. In addition, other than for FMA merging, the CUDA compiler used to be quite conservative regarding the merging floating-point operations (consider issues of intermediate overflow, for example, which could make the behavior quite different between merged and unmerged versions here). Given that -use_fast_mathis specified in the actual use case, one might argue that the merging of the two FMULs is appropriate, because adherence to “as-is” requirements is relaxed with that compilation flag. Consider filing an enhancement request (RFE) for this.
As for result3, collapsing this into code equivalent to that for result4 is the kind of reasoning that is trivial for a human to do, but probably poses interesting issues inside a compiler. A specialized peephole optimization seems feasible but adding to an ever-growing list of peephole optimizations may not be desirable. A more generalized approach based on range tracking of floating-point data is likely challenging and expensive, with not much resulting speed-up on average, so unfavorable trade-off. You might want to consider filing an enhancement request (RFE), though. The CUDA compiler team may have a different take than I provided here.
Thanks for the feedback.
I was wondering what the counter example for result2 would be, and the possibility of intermediate overflow makes sense.
In my use case I am calling cos(pi*saturate(x)), but the multiplications are still not merged. I’ll look into filing some bug/enhancement requests.