Ptxas can not generate reasonable sass

my ptx code with -O0 get unreasonable sass in CUDA 11.1, there is more mov than expected
source ptx below

.reg .f32 f1,f2,f3;
fma.rn.f32 f1, f1, f1, f1;
fma.rn.f32 f2, f2, f2, f2;
fma.rn.f32 f3, f3, f3, f3;
ret;

ptxas gen sass file like below

        /*0008*/                   MOV R1, c[0x0][0x20] ;  /* 0x4c98078000870001 */
        /*0010*/                   FFMA R0, R0, R0, R0 ;   /* 0x5980000000070000 */
        /*0018*/                   FFMA R2, R2, R2, R2 ;   /* 0x5980010000270202 */
                                                           /* 0x007fbc03fde01fef */
        /*0028*/                   FFMA R3, R3, R3, R3 ;   /* 0x5980018000370303 */
        /*0030*/                   MOV R0, R0 ;            /* 0x5c98078000070000 */
        /*0038*/                   MOV R2, R2 ;            /* 0x5c98078000270002 */
                                                           /* 0x001ffc03fde01fef */
        /*0048*/                   MOV R3, R3 ;            /* 0x5c98078000370003 */
        /*0050*/                   EXIT ;                  /* 0xe30000000007000f */

I expexted sass like below

        /*0008*/                   MOV R1, c[0x0][0x20] ;  /* 0x4c98078000870001 */
        /*0010*/                   FFMA R0, R0, R0, R0 ;   /* 0x5980000000070000 */
        /*0018*/                   FFMA R2, R2, R2, R2 ;   /* 0x5980010000270202 */
                                                           /* 0x007fbc03fde01fef */
        /*0028*/                   FFMA R3, R3, R3, R3 ;   /* 0x5980018000370303 */
        /*0030*/                   EXIT ;                  /* 0xe30000000007000f */

PS: -O1 will remove all code, it is not what i want orz…

If you want the most optimized SASS code, it will be necessary for you to not restrict the compiler from optimizing. Make use of the results of your operations, so that the compiler will not optimize them away.

I want to benchmark the latency and throughtput by ptx. Without -O0, I have to add some memory read/write code to keep it unoptimized >_<

You can use a conditional test to eliminate the probability of read/write. But I can’t tell you how to generate exactly the instruction sequence you desire with no other instructions. There may be a trick or way, I just don’t know it.

good idea, i will try it

optimize is too powerful, final code will do read/write or nothing…
waiting for more reasonable O0 mode…

what I had in mind was something like this:

.visible .func _Z1tfffPf(
        .param .b32 _Z1tfffPf_param_0,
        .param .b32 _Z1tfffPf_param_1,
        .param .b32 _Z1tfffPf_param_2,
        .param .b64 _Z1tfffPf_param_3
)
{
        .reg .pred      %p<2>;
        .reg .f32       %f<11>;
        .reg .b64       %rd<2>;


        ld.param.f32    %f2, [_Z1tfffPf_param_0];
        ld.param.f32    %f3, [_Z1tfffPf_param_1];
        ld.param.f32    %f4, [_Z1tfffPf_param_2];
        ld.param.u64    %rd1, [_Z1tfffPf_param_3];
        fma.rn.f32      %f5, %f2, %f2, %f2;
        fma.rn.f32      %f6, %f3, %f3, %f3;
        fma.rn.f32      %f8, %f4, %f4, %f4;
        add.f32         %f7, %f5, %f6;
        add.f32         %f1, %f7, %f8;
        setp.neu.f32    %p1, %f1, 0f00000000;
        @%p1 bra        BB0_2;

        ld.f32  %f9, [%rd1];
        add.f32         %f10, %f1, %f9;
        st.f32  [%rd1], %f10;

BB0_2:
        ret;

The compiler will not optimize out the ffma instructions, even at maximum optimization. And the write will not occur if you choose input data that is not all zeros. However I cannot get it down to just the instructions you mention. The conditional code adds other instructions.

How many years are you willing to wait? :-) Doesn’t do what you want != unreasonable. There are a number of published papers from people who crafted microbenchmarks for successfully extracting various parameters of several GPU architectures, including instruction latency and throughput.

In the absence of an assembler, you will need to get creative. Two classical ways to address the overhead of a measuring framework are to increase the time spent in code under test, so any overhead becomes small relative to the time spent executing code under test, and/or to add calibration to measure the overhead and subtract it out.

I have get the fma throughtput and latency by adding local write inst outside main loop. But it is not feasible to measure other arch parameters. If -O0 do not adding redundant mov, it will be a precise solution to do many test directly.

good, lower overhead than my directly local write, but it need a reduce sequence to make all fma reg referenced by fake write inst

That’s right. You’re producing 3 (independent) results, therefore it stands to reason that you will need, subsequent to that, at least one instruction each that depends each of those 3 results, in order to get the compiler to not optimize-out any of those 3 instructions. There may be a more clever way than 2 adds, and a conditional test. I haven’t spent much time thinking about it. For instance, if you’re calling this in a loop, you could simply keep a running sum, and then perform your test at the end. A slight reduction in average overhead.