Counting FLOPS based on SASS code.

I have the SASS code of my bottleneck kernel:

Function : _Z5DChi2P6float2S0_PfS0_S1_S1_S1_llffffff
	.headerflags    @"EF_CUDA_SM37 EF_CUDA_PTX_SM(EF_CUDA_SM37)"
                                                                                    /* 0x088c8010a0a08c00 */
        /*0008*/                   MOV R1, c[0x0][0x44];                            /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_CTAID.Y;                              /* 0x86400000131c0002 */
        /*0018*/                   S2R R3, SR_TID.Y;                                /* 0x86400000111c000e */
        /*0020*/                   IMAD R4, R0, c[0x0][0x2c], R3;                   /* 0x51080c00059c0012 */
        /*0028*/                   IMUL.U32.U32 R6.CC, R4, c[0x0][0x178];           /* 0x61c400002f1c101a */
        /*0030*/                   ISET.LT.AND R2, R4, RZ, PT;                      /* 0xda981c007f9c100a */
        /*0038*/                   S2R R0, SR_CTAID.X;                              /* 0x86400000129c0002 */
                                                                                    /* 0x08809c8010a08c9c */
        /*0048*/                   S2R R3, SR_TID.X;                                /* 0x86400000109c000e */
        /*0050*/                   IMAD.U32.U32.HI.X R5.CC, R4, c[0x0][0x178], RZ;  /* 0x5217fc002f1c1016 */
        /*0058*/                   IMAD R3, R0, c[0x0][0x28], R3;                   /* 0x51080c00051c000e */
        /*0060*/                   IMAD.U32.U32.X R5, R2, c[0x0][0x178], R5;        /* 0x501014002f1c0816 */
        /*0068*/                   ISET.LT.AND R2, R3, RZ, PT;                      /* 0xda981c007f9c0c0a */
        /*0070*/                   IADD R0.CC, R6, R3;                              /* 0xe0840000019c1802 */
        /*0078*/                   IMAD.U32.U32 R5, R4, c[0x0][0x17c], R5;          /* 0x500014002f9c1016 */
                                                                                    /* 0x08b0dca0b010a09c */
        /*0088*/                   SHF.L R13, RZ, 0x3, R0;                          /* 0xb7c00000019ffc35 */
        /*0090*/                   IADD.X R12, R5, R2;                              /* 0xe0804000011c1432 */
        /*0098*/                   IADD R6.CC, R13, c[0x0][0x140];                  /* 0x60840000281c341a */
        /*00a0*/                   SHF.L.U64 R14, R0, 0x3, R12;                     /* 0xb7c03200019c0039 */
        /*00a8*/                   IADD.X R7, R14, c[0x0][0x144];                   /* 0x60804000289c381e */
        /*00b0*/                   LD.E R2, [R6];                                   /* 0xc4800000001c1808 */
        /*00b8*/                   FSETP.GTU.AND P0, PT, R2, c[0x0][0x18c], PT;     /* 0x5de01c00319c081e */
                                                                                    /* 0x08a0109c108010b8 */
        /*00c8*/               @P0 EXIT;                                            /* 0x180000000000003c */
        /*00d0*/                   F2I.S32.F32.TRUNC R2, c[0x0] [0x190];            /* 0x65800c00321c680a */
        /*00d8*/                   MOV R21, RZ;                                     /* 0xe4c03c007f9c0056 */
        /*00e0*/                   MOV R7, c[0x0][0x180];                           /* 0x64c03c00301c001e */
        /*00e8*/                   MOV R6, c[0x0][0x184];                           /* 0x64c03c00309c001a */
        /*00f0*/                   F2I.S32.F32.TRUNC R5, c[0x0] [0x194];            /* 0x65800c00329c6816 */
        /*00f8*/                   ISUB R2, R3, R2;                                 /* 0xe0880000011c0c0a */
                                                                                    /* 0x088ca010a010a010 */
        /*0108*/                   ISUB RZ.CC, R7, 0x1;                             /* 0xc08c0000009c1ffd */
        /*0110*/                   I2F.F32.S32 R2, R2;                              /* 0xe5c00000011ca80a */
        /*0118*/                   ISUB R3, R4, R5;                                 /* 0xe0880000029c100e */
        /*0120*/                   FMUL R2, R2, c[0x0][0x198];                      /* 0x63400000331c080a */
        /*0128*/                   ISETP.LT.X.AND P0, PT, R6, RZ, PT;               /* 0xdb185c007f9c181e */
        /*0130*/                   I2F.F32.S32 R3, R3;                              /* 0xe5c00000019ca80e */
        /*0138*/                   FMUL R6, R3, c[0x0][0x19c];                      /* 0x63400000339c0c1a */
                                                                                    /* 0x0880108810801000 */
        /*0148*/               @P0 BRA 0x3c0;                                       /* 0x120000013800003c */
        /*0150*/                   FMUL32I R15, R2, 0.017453292384743690491;        /* 0x201e477d1a9c083e */
        /*0158*/                   FMUL32I R16, R6, 0.017453292384743690491;        /* 0x201e477d1a9c1842 */
        /*0160*/                   MOV R2, c[0x0][0x158];                           /* 0x64c03c002b1c000a */
        /*0168*/                   MOV R3, c[0x0][0x15c];                           /* 0x64c03c002b9c000e */
        /*0170*/                   MOV R4, c[0x0][0x170];                           /* 0x64c03c002e1c0012 */
        /*0178*/                   MOV R5, c[0x0][0x174];                           /* 0x64c03c002e9c0016 */
                                                                                    /* 0x0884801080108810 */
        /*0188*/                   MOV R6, c[0x0][0x168];                           /* 0x64c03c002d1c001a */
        /*0190*/                   MOV R7, c[0x0][0x16c];                           /* 0x64c03c002d9c001e */
        /*0198*/                   MOV R8, c[0x0][0x160];                           /* 0x64c03c002c1c0022 */
        /*01a0*/                   MOV R9, c[0x0][0x164];                           /* 0x64c03c002c9c0026 */
        /*01a8*/                   MOV R21, RZ;                                     /* 0xe4c03c007f9c0056 */
        /*01b0*/                   MOV R17, RZ;                                     /* 0xe4c03c007f9c0046 */
        /*01b8*/                   MOV R18, RZ;                                     /* 0xe4c03c007f9c004a */
                                                                                    /* 0x08a0a0cc8c100010 */
        /*01c8*/                   MOV32I R20, 0xbab6061a;                          /* 0x745d5b030d1fc052 */
        /*01d0*/                   MOV32I R19, 0x3c08839e;                          /* 0x741e0441cf1fc04e */
        /*01d8*/                   LD.E R11, [R6];                                  /* 0xc4800000001c182c */
        /*01e0*/                   IADD R17.CC, R17, 0x1;                           /* 0xc0840000009c4445 */
        /*01e8*/                   LD.E R10, [R8];                                  /* 0xc4800000001c2028 */
        /*01f0*/                   FMUL R11, R16, R11;                              /* 0xe3400000059c402e */
        /*01f8*/                   FFMA R10, R15, R10, R11;                         /* 0xcc002c00051c3c2a */
                                                                                    /* 0x08a010a010a010a0 */
        /*0208*/                   FADD R22, R10, R10;                              /* 0xe2c00000051c285a */
        /*0210*/                   FADD R10, R22, R22;                              /* 0xe2c000000b1c582a */
        /*0218*/                   F2F.F32.F32.TRUNC R29, R22;                      /* 0xe5402c000b1c2876 */
        /*0220*/                   F2F.F32.F32.ROUND R11, R10;                      /* 0xe5402000051c282e */
        /*0228*/                   FSETP.NEU.AND P1, PT, R22, R29, PT;              /* 0xdde81c000e9c583e */
        /*0230*/                   FFMA R10, -R11, 0.5, R22;                        /* 0x940859f8001c2c29 */
        /*0238*/                   F2I.S32.F32.TRUNC R11, R11;                      /* 0xe5800c00059c682e */
                                                                                    /* 0x089c1014a0a010a0 */
        /*0248*/                   FMUL32I R23, R10, 1.5099580252808664227e-07;     /* 0x201a1110b49c285e */
        /*0250*/                   FFMA R10, R10, c[0x2][0x0], R23;                 /* 0x4c005c40001c282a */
        /*0258*/                   FSETP.LEU.AND P3, PT, |R22|, 16777216, PT;       /* 0xb5d81e5c001c5a7d */
        /*0260*/                   FMUL R24, R10, R10;                              /* 0xe3400000051c2862 */
        /*0268*/                   FFMA R23, R24, c[0x2][0x4], R20;                 /* 0x4c005040009c605e */
        /*0270*/                   FFMA R27, R24, c[0x2][0x14], R19;                /* 0x4c004c40029c606e */
        /*0278*/                   IADD.X R18, R18, RZ;                             /* 0xe08040007f9c484a */
                                                                                    /* 0x089c809c8010a010 */
        /*0288*/                   FFMA R25, R23, R24, c[0x2][0x8];                 /* 0x8c006040011c5c66 */
        /*0290*/                   LOP.AND R23, R11, 0x1;                           /* 0xc2000000009c2c5d */
        /*0298*/                   FFMA R27, R27, R24, c[0x2][0x18];                /* 0x8c006040031c6c6e */
        /*02a0*/                   ISETP.EQ.U32.AND P0, PT, R23, 0x1, PT;           /* 0xb3201c00009c5c1d */
        /*02a8*/                   FFMA R25, R25, R24, c[0x2][0xc];                 /* 0x8c006040019c6466 */
        /*02b0*/                   FFMA R27, R27, R24, RZ;                          /* 0xcc03fc000c1c6c6e */
        /*02b8*/                   FFMA R23, R25, R24, c[0x2][0x10];                /* 0x8c006040021c645e */
                                                                                    /* 0x08801090108c10a0 */
        /*02c8*/                   FFMA R24, R27, R10, R10;                         /* 0xcc002800051c6c62 */
        /*02d0*/                   IADD R10, R11, 0x1;                              /* 0xc0800000009c2c29 */
        /*02d8*/                   SEL R28, R24, R23, !P0;                          /* 0xe50020000b9c6072 */
        /*02e0*/                   ISUB RZ.CC, R17, c[0x0][0x180];                  /* 0x608c0000301c47fe */
        /*02e8*/                   SEL R24, R23, R24, !P0;                          /* 0xe50020000c1c5c62 */
        /*02f0*/                   LOP.AND R25, R11, 0x2;                           /* 0xc2000000011c2c65 */
        /*02f8*/                   LOP.AND R27, R10, 0x2;                           /* 0xc2000000011c286d */
                                                                                    /* 0x08a4808080819810 */
        /*0308*/                   FMUL R26, -R28, 1;                               /* 0xc34801fc001c7069 */
        /*0310*/                   LD.E.64 R10, [R2];                               /* 0xc5800000001c0828 */
        /*0318*/                   ISETP.LT.X.AND P0, PT, R18, c[0x0][0x184], PT;   /* 0x5b185c00309c481e */
        /*0320*/                   IADD R8.CC, R8, 0x4;                             /* 0xc0840000021c2021 */
        /*0328*/                   ISETP.EQ.AND P2, PT, R27, RZ, PT;                /* 0xdb281c007f9c6c5e */
        /*0330*/                   ICMP.EQ R26, R28, R26, R25;                      /* 0xda2864000d1c706a */
        /*0338*/              @!P1 FMUL R26, R22, RZ;                               /* 0xe34000007fa4586a */
                                                                                    /* 0x08b08010b0108010 */
        /*0348*/                   LD.E R22, [R4];                                  /* 0xc4800000001c1058 */
        /*0350*/                   IADD.X R9, R9, RZ;                               /* 0xe08040007f9c2426 */
        /*0358*/                   IADD R6.CC, R6, 0x4;                             /* 0xc0840000021c1819 */
        /*0360*/              @!P2 FFMA R24, R24, -1, RZ;                           /* 0x9c03fdfc00286061 */
        /*0368*/                   IADD.X R7, R7, RZ;                               /* 0xe08040007f9c1c1e */
        /*0370*/              @!P3 FADD R24, R26, 1;                                /* 0xc2c001fc002c6861 */
        /*0378*/                   IADD R4.CC, R4, 0x4;                             /* 0xc0840000021c1011 */
                                                                                    /* 0x08b88010a09c8010 */
        /*0388*/                   FMUL R23, R24, R10;                              /* 0xe3400000051c605e */
        /*0390*/                   IADD.X R5, R5, RZ;                               /* 0xe08040007f9c1416 */
        /*0398*/                   IADD R2.CC, R2, 0x8;                             /* 0xc0840000041c0809 */
        /*03a0*/                   FFMA R10, -R26, R11, R23;                        /* 0xcc085c00059c682a */
        /*03a8*/                   IADD.X R3, R3, RZ;                               /* 0xe08040007f9c0c0e */
        /*03b0*/                   FFMA R21, R22, R10, R21;                         /* 0xcc005400051c5856 */
        /*03b8*/               @P0 BRA 0x1d8;                                       /* 0x12007fff0c00003c */
                                                                                    /* 0x0880b8a010a0b010 */
        /*03c8*/                   IADD R2.CC, R13, c[0x0][0x148];                  /* 0x60840000291c340a */
        /*03d0*/                   SHF.L.U64 R5, R0, 0x2, R12;                      /* 0xb7c03200011c0015 */
        /*03d8*/                   IADD.X R3, R14, c[0x0][0x14c];                   /* 0x60804000299c380e */
        /*03e0*/                   LD.E R2, [R2];                                   /* 0xc4800000001c0808 */
        /*03e8*/                   MOV R4, c[0x0][0x150];                           /* 0x64c03c002a1c0012 */
        /*03f0*/                   IMAD.U32.U32 R4.CC, R0, 0x4, R4;                 /* 0xa0041000021c0011 */
        /*03f8*/                   FMUL R0, R2, c[0x0][0x188];                      /* 0x63400000311c0802 */
                                                                                    /* 0x080000b81000a09c */
        /*0408*/                   IADD.X R5, R5, c[0x0][0x154];                    /* 0x608040002a9c1416 */
        /*0410*/                   FMUL R0, R21, R0;                                /* 0xe3400000001c5402 */
        /*0418*/                   ST.E [R4], R0;                                   /* 0xe4800000001c1000 */
        /*0420*/                   MOV RZ, RZ;                                      /* 0xe4c03c007f9c03fe */
        /*0428*/                   EXIT;                                            /* 0x18000000001c003c */
        /*0430*/                   BRA 0x430;                                       /* 0x12007ffffc1c003c */
        /*0438*/                   NOP;                                             /* 0x85800000001c3c02 */
		..........................................................

I’m trying to count the FLOPS of this kernel to estimate them with a formula. For example, supossing that the kernel is executed by 64*64 predicated threads, and that the loop @P0 BRA 0x1d8 (line 122) is executed 107494 times then my FLOPS would be:

FLOPS = 6464(6+(10749437)) = 1.6291e+10

The thing is that if I compare this result with the result that I’ve obtained with nvprof (I’ve profiled the app with the same number of threads and the 107494 times loop) which is 14957864170, the numbers are close but not too much. Why is this happening?
Or perhaps am I’m missing something?

Counting FLOPS is usually a frustrating exercise (my personal take: best avoided), because there is no universal agreement what to count, how to count, where to count.

Does FFMA count as one or two FLOPs (I think two makes the most sense, but opinions differ)?
Do F2I, F2F, FSETP count as FLOPs?

I don’t spot any MUFU instructions in the code, their presence would make the FLOP counting even more “interesting”.

The other complication is typically where FLOPs get counted in and by the hardware: at decode stage, at issue stage, at retire stage? These counts tend to differ because of branching etc.

According to http://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedflops.htm

FFMA counts as two FLOPs. On the other hand, FADD, FMUL and other special operations count as one. Anyway, they don’t mention anything about F2I, F2F and FSETP, which makes me thing that they are not counted.