nvprof numbers are at thread level while nvbit by default reports at warp level.
For nvprof I see
Invocations Metric Name Metric Description Min Max Avg
Device "TITAN V (0)"
Kernel: void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
301 inst_control Control-Flow Instructions 2048000 2048000 2048000
To force nvbit to use thread level stats, one have to export COUNT_WARP_LEVEL=0
So, the command
COUNT_WARP_LEVEL=0 LD_PRELOAD=~/nvbit/nvbit_release/tools/opcode_hist/opcode_hist.so ./matrixMul
Shows the following instructions for each invocation:
BAR.SYNC = 4096000
BRA = 2252800
EXIT = 204800
FFMA = 65536000
IADD3 = 8806400
IMAD = 1024000
IMAD.WIDE = 4300800
ISETP.GT.AND = 2252800
LDG.E.SYS = 4096000
LDS.U = 65536000
LDS.U.128 = 16384000
LEA = 204800
MOV = 3072000
NOP = 4096000
S2R = 1433600
SHF.L.U32 = 1228800
SHFL.IDX = 204800
STG.E.SYS = 204800
STS = 4096000
For volta, according to [1], only EXIT and BRA are considered as control flow instructions. So 2M from nvprof vs. 2.4M from nvbit. The difference basically comes from how nvbit and nvprof developers classify instructions IMO.
To make yourself relax, you can check other instructions types. For example:
nvprof, inst_fp_32 is 65,536,000
nvbit (FFMA is the only FP32 instruction here) is also 65,536,000
nvprof, inst_integer is 17,817,600
nvbit
IADD3 = 8806400
IMAD = 1024000
IMAD.WIDE = 4300800
ISETP.GT.AND = 2252800
LEA = 204800
SHF.L.U32 = 1228800
SHFL.IDX = 204800
=> 18,022,400
You may also want to ask nvbit guys for more precise definitions.
[1] CUDA Binary Utilities :: CUDA Toolkit Documentation