In order to exactly find which nvprof instruction metrics relate to warp or thread level, I did some tests for a simple single kernel with one invocation. Therefore min/max/avg are the same and the result is shown below
inst_executed = 1,464,512
inst_bit_convert = 2,097,152
inst_compute_ld_st = 1,086,464
inst_control = 4,228,096
inst_fp_32 = 1,048,576
inst_integer = 31,998,944
inst_inter_thread_communication = 0
inst_misc = 2,109,440
inst_per_warp = 4.5766e+04 (45,766)
As you can see inst_executed is much less than sum(2…8). If we look at the output of nvbit (warp level), we see
Summing up those number yields inst_executed. So, this metric is at warp level. Sounds reasonable.
If we multiply 1,464,512 by 32 we get 46,864,384 which is roughly corresponds to sum(2…8). That means 2…8 are at thread level. In fact, sum(2…8) is 42,568,672. While this difference is not the main question, if someone can explain, I will appreciate that.
My question is that what is inst_per_warp exactly? The number is 45,766. This is actually inst_executed/32 which yields 1,464,512/32=45,766.
METRIC UNITS DESCRIPTION
inst_executed warp instructions executed The number of instructions executed
inst_bit_convert predicated true thread instructions executed Number of bit-conversion instructions executed by non-predicated threads
inst_compute_ld_st predicated true thread instructions executed Number of compute load/store instructions executed by non-predicated threads
inst_control predicated true thread instructions executed Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
inst_fp_32 predicated true thread instructions executed Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
inst_integer predicated true thread instructions executed Number of integer instructions executed by non-predicated threads
inst_inter_thread_communication predicated true thread instructions executed Number of inter-thread communication instructions executed by non-predicated threads
inst_misc predicated true thread instructions executed Number of miscellaneous instructions executed by non-predicated threads
inst_per_warp inst_executed / warps_launched Average number of instructions executed by each warp
Descriptions are from nvprof --query-metrics on a GV100.
inst_executed and inst_per_warp are collected using HWPM performance monitor. inst_{type} are collected by patching the source code.
The newer version of CUPTI uses Perfworks library to collect metrics. The Perfworks metrics have a consistent naming scheme and each metrics has a unit/dimension attribute to help avoid these types of confusion.
Excuse me, mine is not three column help. Is that for nvprof 10.2 or I have to do something else
$ nvprof --query-metrics | grep inst_per_warp
inst_per_warp: Average number of instructions executed by each warp
$ nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.1.168 (21)
One more question. I see inst_executed in both metrics and events. However they yield different results.
$ nvprof --kernels sgemm --events inst_executed --metrics inst_executed ./mmul_1 2048
==24964== NVPROF is profiling process 24964, command: ./mmul_1 2048
A =
B =
C =
==24964== Profiling application: ./mmul_1 2048
==24964== Profiling result:
==24964== Event result:
Invocations Event Name Min Max Avg Total
Device "TITAN V (0)"
Kernel: volta_sgemm_128x32_nn
20 inst_executed 373071872 373071872 373071872 7461437440
==24964== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "TITAN V (0)"
Kernel: volta_sgemm_128x32_nn
20 inst_executed Instructions Executed 373071872 373071872 373071872
Which one is reliable then? I didn’t find a clear explanation between metric and event in the nvprof manual. Maybe it is somewhere else. If you know where it is explained, please let me know.
I added the units column as the CUPTI API does not have this information per event/metric so you have to determine the units/dimensions from the brief description, a past GTC presentation, or the forums.
Events vs. Metrics
Events are raw hardware counters.
Metrics are equations based upon 1 or more events.
I apologize but I do not see the difference in the value reported above. I expect the value to be the same as the metric should be calculated from the event. If you were to run two different times the result may be different in the kernel is not deterministic (e.g. polling loop in the code). inst_executed can be collected from all SM sub-partitions in a single pass so the value should be consistent.
I am sorry. As the text wrapped in the terminal output, I missed the point that event stats have a forth column, total. The difference between that and avg column in metric, wrongly triggered me.