I have code that uses heavily __sinf and __cosf (or __sincosf).
When I compile this
__device__ float sin2pi(float a)
the PTX will be
code for sm_75
Function : _Z6sin2pif
.headerflags @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
/*0000*/ FMUL.FTZ R4, R4, 6.2831854820251464844 ; /* 0x40c90fdb04047820 */
/* 0x000fc80000410000 */
/*0010*/ FMUL.RZ R4, R4, 0.15915493667125701904 ; /* 0x3e22f98304047820 */
/* 0x000fcc000040c000 */
/*0020*/ MUFU.SIN R4, R4 ; /* 0x0000000400047308 */
/* 0x000e240000000400 */
/*0030*/ RET.ABS.NODEC R20 0x0 ; /* 0x0000000014007950 */
/* 0x001fea0003e00000 */
/*0040*/ BRA 0x40; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0050*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0060*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0070*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
The compiler generates “FMUL.RZ R4, R4, 0.15915493667125701904 ;” that is multiply by 1/(2PI).
Would it be possible to have __sin2pif, __cos2pif and __sincos2pif so that the two multiplications that cancel each other could be omitted?
Has the CUDA profiler actually identified this computation as a bottleneck in your code? BTW, the generated code you are showing appears to be machine code (SASS), not intermediate code (PTX).
In floating-point arithmetic, generally a * (1/a) != 1, so the multiplications don’t actually “cancel”. Also, there is the chance of intermediate overflow or underflow, leading to drastically different results. In other words, the transformation envisioned violates the as-if rule of optimizations generally used with C++.
One could imagine the desired transformation taking place with a “relaxed” optimization setting, in particular --use_fast_math. Have you tried that? A practical problem I see is how much error tolerance to allow as part of such a transformation, e.g. should x * 0.33 * 3 likewise be simplified to x?
I have the flag --use_fast_math.
The sin and cos are in critical path that is run 111000 times on 1024*1024 (1M) floats. So the program could save approx 222 000 000 000 multiply operations.
Is there a way to make in line PTX to emit sin.approx.f32 without the compiler generating the mul by 0.159… before the sin or cos instruction.
I tried this
device float inline __sin2pif(float a)
asm(“sin.approx.f32 %0, %1;” : “=f”(v) : “f”(a));
but it generates for sm70 and sm75 the extra mul that divides by 2PIf before sin (and cos).
My point was: Saving those FMUL instructions does not necessarily translate into a commensurate reduction in execution time. If the code shown is in fact limiting performance, my guess would be that it’s limited by MUFU throughput.
Based on the article that was published about the design of the MUFU (multi-function unit) it works with fixed-point computation internally, so there needs to be a conversion from single-precision floating-point to the internal format it uses. In earlier architectures, this was done with a RRO.SINCOS (range reduction operation for trig functions), but from your post it seems that sm_75 changed the interface for MUFU.SIN slightly so that a simple FMUL can be used instead (this was, frankly, news to me).
This means you cannot get rid of that FMUL, as PTX’s sin.approx.f32 essentially translates to a canned two-instruction sequence FMUL/MUFU.SIN at SASS level.
If you can formulate a well-reasoned proposal on how consecutive floating-point multiplications by constants could be reduced to a single multiply (or even eliminated in the best case when the resulting multiplier is 1) as part of, or in combination with, --use_fast_math, consider filing an enhancement request with NVIDIA.
I was considering an extension to existing API that has sinpif etc. so that it could have __sin2pif on new architectures generating code without the mul and on old architectures it would do mul 2PIf and rro and then sin.
But I’ll wait.
I can focus on other parts of the program (seti).
Since CUDA already has math functions sinpif() and cospif() that programmers would normally use for use cases like yours, an alternative proposal might be to add corresponding device-function intrinsics __sinpif(), __cospif(), __sincospif().
These could then be mapped to MUFU.SIN, MUFU.COS using a multiplier of 0.5 instead (1/(2*PI)), which would make combination with other multipliers easier when --use_fast_math is in effect as multiplications by powers of two are exact except in the case of denormals (subnormals), which don’t come into play (because --use_fast_math includes -ftz=true).
In any event nothing is going happen until an RFE (request for enhancement) is filed with NVIDIA, at which point the compiler and floating-point experts at NVIDIA could start thinking about all possible consequences of such a proposal and either adopt it for future implementation or shoot it down because of an issue we haven’t thought about yet (design by forum posts usually doesn’t work too well :-).