When I compare the output of nvbit and nvprof, I see a large gap between instruction statistics.
nvprof
$ nvprof --metrics inst_integer,inst_fp_32,inst_fp_16,inst_compute_ld_st,inst_bit_convert,inst_inter_thread_communication,inst_fp_64,inst_control,inst_misc ./test-apps/vectoradd/vectoradd
==8587== NVPROF is profiling process 8587, command: ./test-apps/vectoradd/vectoradd
Final sum = 100000.000000; sum/n = 1.000000 (should be ~1)
==8587== Profiling application: ./test-apps/vectoradd/vectoradd
==8587== Profiling result:
==8587== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "TITAN V (0)"
Kernel: vecAdd(double*, double*, double*, int)
1 inst_integer Integer Instructions 500704 500704 500704
1 inst_fp_32 FP Instructions(Single) 0 0 0
1 inst_fp_16 HP Instructions(Half) 0 0 0
1 inst_compute_ld_st Load/Store Instructions 300000 300000 300000
1 inst_bit_convert Bit-Convert Instructions 0 0 0
1 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
1 inst_fp_64 FP Instructions(Double) 100000 100000 100000
1 inst_control Control-Flow Instructions 100352 100352 100352
1 inst_misc Misc Instructions 401056 401056 401056
nvbit
$ LD_PRELOAD=./tools/opcode_hist/opcode_hist.so ./test-apps/vectoradd/vectoradd ------------- NVBit (NVidia Binary Instrumentation Tool v1.1) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
NOINSPECT = 0 - if set, skips function inspection and instrumentation
NVDISASM = nvdisasm - override default nvdisasm found in PATH
NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
KERNEL_BEGIN = 0 - Beginning of the kernel launch interval where to apply instrumentation
KERNEL_END = 4294967295 - End of the kernel launch interval where to apply instrumentation
TOOL_VERBOSE = 0 - Enable verbosity inside the tool
COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
----------------------------------------------------------------------------------------------------
kernel 0 - vecAdd(double*, double*, double*, int) - #thread-blocks 98, kernel instructions 50077, total instructions 50077
DADD = 3125
EXIT = 6261
IMAD = 3136
IMAD.WIDE = 9375
ISETP.GE.AND = 3136
LDG.E.64.SYS = 6250
MOV = 6261
S2R = 6272
SHFL.IDX = 3136
STG.E.64.SYS = 3125
Final sum = 100000.000000; sum/n = 1.000000 (should be ~1)
The total number of instructions reported by nvbit is far less than those reported by nvprof. Any idea about that?