In order to exactly find which nvprof instruction metrics relate to warp or thread level, I did some tests for a simple single kernel with one invocation. Therefore min/max/avg are the same and the result is shown below

- inst_executed = 1,464,512
- inst_bit_convert = 2,097,152
- inst_compute_ld_st = 1,086,464
- inst_control = 4,228,096
- inst_fp_32 = 1,048,576
- inst_integer = 31,998,944
- inst_inter_thread_communication = 0
- inst_misc = 2,109,440
- inst_per_warp = 4.5766e+04 (45,766)

As you can see **inst_executed** is much less than sum(2…8). If we look at the output of nvbit (warp level), we see

BAR.SYNC = 64

BRA = 68640

BSSY = 32800

BSYNC = 32800

CS2R = 65760

EXIT = 32

F2I.FTZ.U32.TRUNC.NTZ = 32768

I2F.U32.RP = 32768

IADD3 = 298144

IADD3.X = 34976

IMAD = 65536

IMAD.MOV = 65536

IMAD.MOV.U32 = 67680

IMAD.SHL.U32 = 32

IMAD.WIDE.U32 = 65536

IMAD.X = 164864

ISETP.GE.U32.AND = 100352

ISETP.GE.U32.AND.EX = 34816

ISETP.GT.U32.AND = 32

ISETP.NE.AND.EX = 1024

ISETP.NE.U32.AND = 33792

LDG.E.64.STRONG.CTA = 33792

LEA = 33792

LEA.HI.X = 33792

LOP3.LUT = 98304

MUFU.RCP = 32768

NOP = 64

S2R = 32

SHF.R.U32.HI = 32

SHFL.IDX = 33824

STG.E.64.SYS = 160

Summing up those number yields **inst_executed**. So, this metric is at warp level. Sounds reasonable.

If we multiply 1,464,512 by 32 we get 46,864,384 which is roughly corresponds to sum(2…8). That means 2…8 are at thread level. In fact, sum(2…8) is 42,568,672. While this difference is not the main question, if someone can explain, I will appreciate that.

My question is that what is **inst_per_warp** exactly? The number is 45,766. This is actually **inst_executed/32** which yields 1,464,512/32=45,766.