About instruction per warp metric

In order to exactly find which nvprof instruction metrics relate to warp or thread level, I did some tests for a simple single kernel with one invocation. Therefore min/max/avg are the same and the result is shown below

  1. inst_executed = 1,464,512
  2. inst_bit_convert = 2,097,152
  3. inst_compute_ld_st = 1,086,464
  4. inst_control = 4,228,096
  5. inst_fp_32 = 1,048,576
  6. inst_integer = 31,998,944
  7. inst_inter_thread_communication = 0
  8. inst_misc = 2,109,440
  9. inst_per_warp = 4.5766e+04 (45,766)

As you can see inst_executed is much less than sum(2…8). If we look at the output of nvbit (warp level), we see

BAR.SYNC = 64
BRA = 68640
BSSY = 32800
BSYNC = 32800
CS2R = 65760
EXIT = 32
F2I.FTZ.U32.TRUNC.NTZ = 32768
I2F.U32.RP = 32768
IADD3 = 298144
IADD3.X = 34976
IMAD = 65536
IMAD.MOV = 65536
IMAD.MOV.U32 = 67680
IMAD.SHL.U32 = 32
IMAD.WIDE.U32 = 65536
IMAD.X = 164864
ISETP.GE.U32.AND = 100352
ISETP.GE.U32.AND.EX = 34816
ISETP.GT.U32.AND = 32
ISETP.NE.AND.EX = 1024
ISETP.NE.U32.AND = 33792
LDG.E.64.STRONG.CTA = 33792
LEA = 33792
LEA.HI.X = 33792
LOP3.LUT = 98304
MUFU.RCP = 32768
NOP = 64
S2R = 32
SHF.R.U32.HI = 32
SHFL.IDX = 33824
STG.E.64.SYS = 160

Summing up those number yields inst_executed. So, this metric is at warp level. Sounds reasonable.

If we multiply 1,464,512 by 32 we get 46,864,384 which is roughly corresponds to sum(2…8). That means 2…8 are at thread level. In fact, sum(2…8) is 42,568,672. While this difference is not the main question, if someone can explain, I will appreciate that.

My question is that what is inst_per_warp exactly? The number is 45,766. This is actually inst_executed/32 which yields 1,464,512/32=45,766.

METRIC                              UNITS                                           DESCRIPTION
inst_executed                       warp instructions executed                      The number of instructions executed
inst_bit_convert                    predicated true thread instructions executed    Number of bit-conversion instructions executed by non-predicated threads
inst_compute_ld_st                  predicated true thread instructions executed    Number of compute load/store instructions executed by non-predicated threads
inst_control                        predicated true thread instructions executed    Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
inst_fp_32                          predicated true thread instructions executed    Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
inst_integer                        predicated true thread instructions executed    Number of integer instructions executed by non-predicated threads
inst_inter_thread_communication     predicated true thread instructions executed    Number of inter-thread communication instructions executed by non-predicated threads
inst_misc                           predicated true thread instructions executed    Number of miscellaneous instructions executed by non-predicated threads
inst_per_warp                       inst_executed / warps_launched                  Average number of instructions executed by each warp

Descriptions are from nvprof --query-metrics on a GV100.

inst_executed and inst_per_warp are collected using HWPM performance monitor. inst_{type} are collected by patching the source code.

The newer version of CUPTI uses Perfworks library to collect metrics. The Perfworks metrics have a consistent naming scheme and each metrics has a unit/dimension attribute to help avoid these types of confusion.

1 Like

Excuse me, mine is not three column help. Is that for nvprof 10.2 or I have to do something else

$ nvprof --query-metrics | grep inst_per_warp
                   inst_per_warp:  Average number of instructions executed by each warp
$ nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.1.168 (21)

One more question. I see inst_executed in both metrics and events. However they yield different results.

$ nvprof --kernels sgemm --events inst_executed --metrics inst_executed  ./mmul_1 2048
==24964== NVPROF is profiling process 24964, command: ./mmul_1 2048
A =
B =
C =
==24964== Profiling application: ./mmul_1 2048
==24964== Profiling result:
==24964== Event result:
Invocations                                Event Name         Min         Max         Avg       Total
Device "TITAN V (0)"
    Kernel: volta_sgemm_128x32_nn
         20                             inst_executed   373071872   373071872   373071872  7461437440

==24964== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "TITAN V (0)"
    Kernel: volta_sgemm_128x32_nn
         20                             inst_executed                     Instructions Executed   373071872   373071872   373071872

Which one is reliable then? I didn’t find a clear explanation between metric and event in the nvprof manual. Maybe it is somewhere else. If you know where it is explained, please let me know.

Unit Column

I added the units column as the CUPTI API does not have this information per event/metric so you have to determine the units/dimensions from the brief description, a past GTC presentation, or the forums.

Events vs. Metrics

Events are raw hardware counters.
Metrics are equations based upon 1 or more events.

I apologize but I do not see the difference in the value reported above. I expect the value to be the same as the metric should be calculated from the event. If you were to run two different times the result may be different in the kernel is not deterministic (e.g. polling loop in the code). inst_executed can be collected from all SM sub-partitions in a single pass so the value should be consistent.

1 Like

I am sorry. As the text wrapped in the terminal output, I missed the point that event stats have a forth column, total. The difference between that and avg column in metric, wrongly triggered me.

Thank you very much.