Suboptimal SASS from compiler for some simple operations

spraesi · January 9, 2023, 1:26pm

I have the following code, compiled with -use_fast_math for compute capability 8.6:

__device__ float saturate_float(float value) {
    // https://stackoverflow.com/a/46748056
    return max(0.0f, min(1.0f, value));
}

__global__ void test(float *arbitrary_values) {
    const float TWO_PI = 2.0*M_PI;
    
    float input = arbitrary_values[0];
    
   
    float result1 = saturate_float(input + TWO_PI);
    // Expectation: Should compile to single instruction add.sat/fma.sat
    // Reality: Compiles to two instructions.
    // SASS: FADD.FTZ R0, R0, 6.2831854820251464844
    //       FADD.FTZ.SAT R0, RZ, R0 
    // Using __saturatef(input + PI) does not help.
    
    float result2 = cos(input*TWO_PI);
    // Expectation: Should compile to single instruction.
    // Reality: Compiles to three.
    // SASS: FMUL.FTZ R4, R0, 6.2831854820251464844 
    //       FMUL.RZ R5, R4, 0.15915493667125701904 
    //       MUFU.COS R5, R5
    // Would be nice with a __cospi intrinsic, just to avoid the redundant FMUL.RZ.
    // Using __cosf(__fmaf_ieee_rz(input, PI, 0.0f)) does not help.
    
    bool result3 = -0.5f <= input & input <= 0.5f;
    // Expectation: Should compile to the same as result4.
    // Reality: Compiles to three instructions.
    // SASS:  FSETP.GE.FTZ.AND P0, PT, R0, -0.5, PT 
    //        FSETP.GTU.FTZ.AND P1, PT, R0, 0.5, PT 
    //        PLOP3.LUT P0, PT, P0, P1, PT, 0x20, 0x0 
    // Using && does not help.
    
    bool result4 = abs(input) <= 0.5f;
    // Reality: Compiles to single SASS instruction (yay!)
    //   PTX: abs.f32 followed by setp.le.f32
    //  SASS: FSETP.GTU.FTZ.AND P1, PT, |R0|, 0.5, PT 
    // I've also seen the single instruction FSET.BF.LE.FTZ.AND used instead.
    
    // Arbitrary return value to ensure code is not pruned by optimizer.
    arbitrary_values[0] = result1 + result2 + result3 + result4;
}

For the results 1-3, it seems the compiler generate sub-optimal code? At least for result1 and result3. Any thoughts for why?

njuffa · January 9, 2023, 7:49pm

Without -use_fast_math, I see that the saturation is successfully merged with the addition for result1. This looks like a case where the merging does not happen with the .ftz variant of fadd. I cannot think of a reason for this. The optimization may erroneously not be applied for variants other than the default variant of fadd. I would file a bug for this.

        /*0040*/                   LDG.E R0, [R2.64] ;                       /* 0x0000000402007981 */
        /*0050*/                   FADD.SAT R5, R0, 6.2831854820251464844 ;  /* 0x40c90fdb00057421 */
        /*0060*/                   STG.E [R2.64], R5 ;                       /* 0x0000000502007986 */

I don’t think the expectation is realistic because 6.2831854820251464844 * 0.15915493667125701904 != 1.0. As for why the two multiplications are not merged: there may be a phase-ordering issue where __cosf is expanded late (the expansion is GPU architecture specific; on older architectures this expands into RRO, MUFU.COS), after constant propagation. In addition, other than for FMA merging, the CUDA compiler used to be quite conservative regarding the merging floating-point operations (consider issues of intermediate overflow, for example, which could make the behavior quite different between merged and unmerged versions here). Given that -use_fast_mathis specified in the actual use case, one might argue that the merging of the two FMULs is appropriate, because adherence to “as-is” requirements is relaxed with that compilation flag. Consider filing an enhancement request (RFE) for this.

As for result3, collapsing this into code equivalent to that for result4 is the kind of reasoning that is trivial for a human to do, but probably poses interesting issues inside a compiler. A specialized peephole optimization seems feasible but adding to an ever-growing list of peephole optimizations may not be desirable. A more generalized approach based on range tracking of floating-point data is likely challenging and expensive, with not much resulting speed-up on average, so unfavorable trade-off. You might want to consider filing an enhancement request (RFE), though. The CUDA compiler team may have a different take than I provided here.

spraesi · January 11, 2023, 8:27am

Thanks for the feedback.
I was wondering what the counter example for result2 would be, and the possibility of intermediate overflow makes sense.
In my use case I am calling cos(pi*saturate(x)), but the multiplications are still not merged. I’ll look into filing some bug/enhancement requests.

Thanks again :)