Comparing nvbit and nvprof

When I compare the output of nvbit and nvprof, I see a large gap between instruction statistics.
nvprof

$ nvprof --metrics inst_integer,inst_fp_32,inst_fp_16,inst_compute_ld_st,inst_bit_convert,inst_inter_thread_communication,inst_fp_64,inst_control,inst_misc ./test-apps/vectoradd/vectoradd
==8587== NVPROF is profiling process 8587, command: ./test-apps/vectoradd/vectoradd
Final sum = 100000.000000; sum/n = 1.000000 (should be ~1)
==8587== Profiling application: ./test-apps/vectoradd/vectoradd
==8587== Profiling result:
==8587== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "TITAN V (0)"
    Kernel: vecAdd(double*, double*, double*, int)
          1                              inst_integer                      Integer Instructions      500704      500704      500704
          1                                inst_fp_32                   FP Instructions(Single)           0           0           0
          1                                inst_fp_16                     HP Instructions(Half)           0           0           0
          1                        inst_compute_ld_st                   Load/Store Instructions      300000      300000      300000
          1                          inst_bit_convert                  Bit-Convert Instructions           0           0           0
          1           inst_inter_thread_communication                 Inter-Thread Instructions           0           0           0
          1                                inst_fp_64                   FP Instructions(Double)      100000      100000      100000
          1                              inst_control                 Control-Flow Instructions      100352      100352      100352
          1                                 inst_misc                         Misc Instructions      401056      401056      401056

nvbit

$ LD_PRELOAD=./tools/opcode_hist/opcode_hist.so ./test-apps/vectoradd/vectoradd              ------------- NVBit (NVidia Binary Instrumentation Tool v1.1) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
           NOINSPECT = 0 - if set, skips function inspection and instrumentation
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
        KERNEL_BEGIN = 0 - Beginning of the kernel launch interval where to apply instrumentation
          KERNEL_END = 4294967295 - End of the kernel launch interval where to apply instrumentation
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
    COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
    EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
----------------------------------------------------------------------------------------------------
kernel 0 - vecAdd(double*, double*, double*, int) - #thread-blocks 98,  kernel instructions 50077, total instructions 50077
  DADD = 3125
  EXIT = 6261
  IMAD = 3136
  IMAD.WIDE = 9375
  ISETP.GE.AND = 3136
  LDG.E.64.SYS = 6250
  MOV = 6261
  S2R = 6272
  SHFL.IDX = 3136
  STG.E.64.SYS = 3125
Final sum = 100000.000000; sum/n = 1.000000 (should be ~1)

The total number of instructions reported by nvbit is far less than those reported by nvprof. Any idea about that?

Instructions are issued warp-wide. Think about this in conjunction with what the NVBIT tool does (hook/intercept the instruction issue process).
Multiply the NVBIT numbers by 32.

For this example, you are right. Multiplying nvbit by 32 makes the close to nvprof. However, it seems that some other things are also important.

I ran a graph workload (not the test app) and with nvbit, a kernel named RelaxLightEdge runs 1459 times. Here is one its stats:

kernel 10346 - void gunrock::oprtr::LB::RelaxLightEdges....... - #thread-blocks 1,  kernel instructions 2736, total instructions 335140486
  ATOMG.E.MIN.STRONG.GPU = 1
  BAR.SYNC = 96
  BRA = 243
  BREAK = 64
  BSSY = 162
  BSYNC = 133
  CS2R = 32
  EXIT = 64
  FLO.U32 = 1
  IADD3 = 172
  IMAD = 47
  IMAD.IADD = 82
  IMAD.MOV.U32 = 313
  IMAD.SHL.U32 = 33
  IMAD.WIDE.U32 = 72
  IMAD.X = 1
  ISETP.EQ.OR = 1
  ISETP.EQ.U32.AND = 1
  ISETP.GE.U32.AND = 161
  ISETP.GT.U32.AND = 50
  ISETP.LT.U32.AND = 8
  ISETP.NE.AND = 98
  ISETP.NE.AND.EX = 2
  ISETP.NE.OR = 32
  ISETP.NE.U32.AND = 2
  LDC.U8 = 96
  LDG.E.CONSTANT.SYS = 1
  LDG.E.STRONG.GPU = 1
  LDG.E.SYS = 68
  LDS.U = 12
  LEA = 2
  LOP3.LUT = 132
  MOV = 43
  NOP = 96
  POPC = 1
  PRMT = 64
  RED.E.ADD.STRONG.GPU = 1
  S2R = 106
  SEL = 10
  SHFL.IDX = 96
  STG.E.STRONG.GPU = 3
  STS = 36
  VOTE.ALL = 1
  WARPSYNC = 64
  YIELD = 32

I also manually found the corresponding instruction classes (on volta) which are summarized as

ls            219
  misc          331
  control       762
  int           1211
  mov           213

Now looking at the output of nvprof for this kernel, we see

$ nvprof --kernels  RelaxLightEdges --metrics inst_integer,inst_control,inst_misc ../../../build/bin/bfs market belgium_osm.mtx
.
.
.
==11430== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "TITAN V (0)"
    Kernel: _ZN7gunrock5oprtr2LB15RelaxLightEdgesILj1ENS_5graph3CsrIjjjLj256ELj0ELb1EEEjjZNS_3app3bfs16BFSIterationLoopINS7_7EnactorINS7_7ProblemINS6_9TestGraphIjjjLj768ELj0EEEjjLj0EEELj0ELj0EEEE4CoreEiEUlRKjRjSH_SH_SH_SI_E_EEvbNT0_7VertexTESK_PKT1_NSK_5SizeTEPKSP_PT2_PNSK_6ValueTEPSP_NS_4util15CtaWorkProgressISP_EET3_
       1459                              inst_integer                      Integer Instructions       32372      697444      261807
       1459                              inst_control                 Control-Flow Instructions       17417      216045       85712
       1459                                 inst_misc                         Misc Instructions       13849      225733       86814

For example, the max inst_integer is 697,444 for 1459 invocations. So, for one invocation it will be 478.
If you see nvbit, the int class is 1211. Then multiplying that by 32 we get 38,752 which is far beyond 478.

I think these two tools assumes different things. But it is not clear.
Maybe IMAD.IADD is considered as int but IMAD.MOV is not int. I don’t know though…
I didn’t find these things in the documents.

I’m not sure you understand how to use nvprof. That’s not how I interpret this output from nvprof:

==11430== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "TITAN V (0)"
    Kernel: _ZN7gunrock5oprtr2LB15RelaxLightEdgesILj1ENS_5graph3CsrIjjjLj256ELj0ELb1EEEjjZNS_3app3bfs16BFSIterationLoopINS7_7EnactorINS7_7ProblemINS6_9TestGraphIjjjLj768ELj0EEEjjLj0EEELj0ELj0EEEE4CoreEiEUlRKjRjSH_SH_SH_SI_E_EEvbNT0_7VertexTESK_PKT1_NSK_5SizeTEPKSP_PT2_PNSK_6ValueTEPSP_NS_4util15CtaWorkProgressISP_EET3_
       1459                              inst_integer                      Integer Instructions       32372      697444      261807

That says “across all invocations of this kernel, the maximum count for that metric (for a particular invocation) was 697444, the minimum count for that metric (for a particular invocation) was 32372, and the average (which may not have been achieved by any particular invocation) was 261807”

If the nvbit measurement was for a single invocation, and the measurement number multiplied by 32 was 38,752, that is entirely plausible and within the range for that metric reported by nvprof.

According to

--print-gpu-trace
                        Print individual kernel invocations (including CUDA memcpy's/memset's)
                        and sort them in chronological order. In event/metric profiling
                        mode, show events/metrics for each kernel invocation.

and since I didn’t use that option, I guessed that I see an average over all invocations.
Thanks for clarifying that.

Avg IS the average over all invocations.

Max is the max over all invocations. it is not the sum of all invocations, added together.