Has the CUDA profiler actually identified this computation as a bottleneck in your code? BTW, the generated code you are showing appears to be machine code (SASS), not intermediate code (PTX).
In floating-point arithmetic, generally a * (1/a) != 1, so the multiplications don’t actually “cancel”. Also, there is the chance of intermediate overflow or underflow, leading to drastically different results. In other words, the transformation envisioned violates the as-if rule of optimizations generally used with C++.
One could imagine the desired transformation taking place with a “relaxed” optimization setting, in particular --use_fast_math. Have you tried that? A practical problem I see is how much error tolerance to allow as part of such a transformation, e.g. should x * 0.33 * 3 likewise be simplified to x?
I have the flag --use_fast_math.
The sin and cos are in critical path that is run 111000 times on 1024*1024 (1M) floats. So the program could save approx 222 000 000 000 multiply operations.
Is there a way to make in line PTX to emit sin.approx.f32 without the compiler generating the mul by 0.159… before the sin or cos instruction.
My point was: Saving those FMUL instructions does not necessarily translate into a commensurate reduction in execution time. If the code shown is in fact limiting performance, my guess would be that it’s limited by MUFU throughput.
Based on the article that was published about the design of the MUFU (multi-function unit) it works with fixed-point computation internally, so there needs to be a conversion from single-precision floating-point to the internal format it uses. In earlier architectures, this was done with a RRO.SINCOS (range reduction operation for trig functions), but from your post it seems that sm_75 changed the interface for MUFU.SIN slightly so that a simple FMUL can be used instead (this was, frankly, news to me).
This means you cannot get rid of that FMUL, as PTX’s sin.approx.f32 essentially translates to a canned two-instruction sequence FMUL/MUFU.SIN at SASS level.
If you can formulate a well-reasoned proposal on how consecutive floating-point multiplications by constants could be reduced to a single multiply (or even eliminated in the best case when the resulting multiplier is 1) as part of, or in combination with, --use_fast_math, consider filing an enhancement request with NVIDIA.
I was considering an extension to existing API that has sinpif etc. so that it could have __sin2pif on new architectures generating code without the mul and on old architectures it would do mul 2PIf and rro and then sin.
But I’ll wait.
Since CUDA already has math functions sinpif() and cospif() that programmers would normally use for use cases like yours, an alternative proposal might be to add corresponding device-function intrinsics __sinpif(), __cospif(), __sincospif().
These could then be mapped to MUFU.SIN, MUFU.COS using a multiplier of 0.5 instead (1/(2*PI)), which would make combination with other multipliers easier when --use_fast_math is in effect as multiplications by powers of two are exact except in the case of denormals (subnormals), which don’t come into play (because --use_fast_math includes -ftz=true).
In any event nothing is going happen until an RFE (request for enhancement) is filed with NVIDIA, at which point the compiler and floating-point experts at NVIDIA could start thinking about all possible consequences of such a proposal and either adopt it for future implementation or shoot it down because of an issue we haven’t thought about yet (design by forum posts usually doesn’t work too well :-).