About instruction per warp metric

mahmood.nt · March 23, 2020, 9:38am

In order to exactly find which nvprof instruction metrics relate to warp or thread level, I did some tests for a simple single kernel with one invocation. Therefore min/max/avg are the same and the result is shown below

inst_executed = 1,464,512
inst_bit_convert = 2,097,152
inst_compute_ld_st = 1,086,464
inst_control = 4,228,096
inst_fp_32 = 1,048,576
inst_integer = 31,998,944
inst_inter_thread_communication = 0
inst_misc = 2,109,440
inst_per_warp = 4.5766e+04 (45,766)

As you can see inst_executed is much less than sum(2…8). If we look at the output of nvbit (warp level), we see

BAR.SYNC = 64
BRA = 68640
BSSY = 32800
BSYNC = 32800
CS2R = 65760
EXIT = 32
F2I.FTZ.U32.TRUNC.NTZ = 32768
I2F.U32.RP = 32768
IADD3 = 298144
IADD3.X = 34976
IMAD = 65536
IMAD.MOV = 65536
IMAD.MOV.U32 = 67680
IMAD.SHL.U32 = 32
IMAD.WIDE.U32 = 65536
IMAD.X = 164864
ISETP.GE.U32.AND = 100352
ISETP.GE.U32.AND.EX = 34816
ISETP.GT.U32.AND = 32
ISETP.NE.AND.EX = 1024
ISETP.NE.U32.AND = 33792
LDG.E.64.STRONG.CTA = 33792
LEA = 33792
LEA.HI.X = 33792
LOP3.LUT = 98304
MUFU.RCP = 32768
NOP = 64
S2R = 32
SHF.R.U32.HI = 32
SHFL.IDX = 33824
STG.E.64.SYS = 160

Summing up those number yields inst_executed. So, this metric is at warp level. Sounds reasonable.

If we multiply 1,464,512 by 32 we get 46,864,384 which is roughly corresponds to sum(2…8). That means 2…8 are at thread level. In fact, sum(2…8) is 42,568,672. While this difference is not the main question, if someone can explain, I will appreciate that.

My question is that what is inst_per_warp exactly? The number is 45,766. This is actually inst_executed/32 which yields 1,464,512/32=45,766.

Greg · March 24, 2020, 12:33am

METRIC                              UNITS                                           DESCRIPTION
inst_executed                       warp instructions executed                      The number of instructions executed
inst_bit_convert                    predicated true thread instructions executed    Number of bit-conversion instructions executed by non-predicated threads
inst_compute_ld_st                  predicated true thread instructions executed    Number of compute load/store instructions executed by non-predicated threads
inst_control                        predicated true thread instructions executed    Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
inst_fp_32                          predicated true thread instructions executed    Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
inst_integer                        predicated true thread instructions executed    Number of integer instructions executed by non-predicated threads
inst_inter_thread_communication     predicated true thread instructions executed    Number of inter-thread communication instructions executed by non-predicated threads
inst_misc                           predicated true thread instructions executed    Number of miscellaneous instructions executed by non-predicated threads
inst_per_warp                       inst_executed / warps_launched                  Average number of instructions executed by each warp

Descriptions are from nvprof --query-metrics on a GV100.

inst_executed and inst_per_warp are collected using HWPM performance monitor. inst_{type} are collected by patching the source code.

The newer version of CUPTI uses Perfworks library to collect metrics. The Perfworks metrics have a consistent naming scheme and each metrics has a unit/dimension attribute to help avoid these types of confusion.

mahmood.nt · March 24, 2020, 8:43am

Excuse me, mine is not three column help. Is that for nvprof 10.2 or I have to do something else

$ nvprof --query-metrics | grep inst_per_warp
                   inst_per_warp:  Average number of instructions executed by each warp
$ nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.1.168 (21)

mahmood.nt · March 24, 2020, 6:58pm

One more question. I see inst_executed in both metrics and events. However they yield different results.

$ nvprof --kernels sgemm --events inst_executed --metrics inst_executed  ./mmul_1 2048
==24964== NVPROF is profiling process 24964, command: ./mmul_1 2048
A =
B =
C =
==24964== Profiling application: ./mmul_1 2048
==24964== Profiling result:
==24964== Event result:
Invocations                                Event Name         Min         Max         Avg       Total
Device "TITAN V (0)"
    Kernel: volta_sgemm_128x32_nn
         20                             inst_executed   373071872   373071872   373071872  7461437440

==24964== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "TITAN V (0)"
    Kernel: volta_sgemm_128x32_nn
         20                             inst_executed                     Instructions Executed   373071872   373071872   373071872

Which one is reliable then? I didn’t find a clear explanation between metric and event in the nvprof manual. Maybe it is somewhere else. If you know where it is explained, please let me know.

Greg · March 24, 2020, 8:14pm

Unit Column

I added the units column as the CUPTI API does not have this information per event/metric so you have to determine the units/dimensions from the brief description, a past GTC presentation, or the forums.

Events vs. Metrics

Events are raw hardware counters.
Metrics are equations based upon 1 or more events.

I apologize but I do not see the difference in the value reported above. I expect the value to be the same as the metric should be calculated from the event. If you were to run two different times the result may be different in the kernel is not deterministic (e.g. polling loop in the code). inst_executed can be collected from all SM sub-partitions in a single pass so the value should be consistent.

mahmood.nt · March 24, 2020, 8:24pm

I am sorry. As the text wrapped in the terminal output, I missed the point that event stats have a forth column, total. The difference between that and avg column in metric, wrongly triggered me.

Thank you very much.

Topic		Replies	Views
Comparing nvbit and nvprof CUDA Programming and Performance	5	1723	December 28, 2019
Doubt regarding definition of "inst_executed" metric - nvprof Visual Profiler and nvprof	1	1322	July 5, 2017
need clarity in definition of inst_per_warp CUDA Programming and Performance	1	626	June 24, 2017
What can be learned from IPC (via nvprof)? CUDA Programming and Performance	9	3170	July 13, 2018
How to accurately time individual memory operations CUDA Programming and Performance	17	6181	September 12, 2016
Inst_executed and thread_inst_executed Nsight Compute	4	1794	October 12, 2021
What are the meanings of the items in nvprof --metrics all? CUDA Programming and Performance	0	424	October 31, 2018
what is IPC(Instructions Per Cycle)? CUDA Programming and Performance	2	2912	October 15, 2018
Control instructions CUDA Programming and Performance	6	948	May 21, 2020
“inst_executed” metric on - nvprof. What does it mean? Visual Profiler and nvprof	0	566	March 22, 2021

About instruction per warp metric

Related topics