Inst_executed and thread_inst_executed

mahmood.nt · September 9, 2020, 7:57pm

I would like to know what is there difference between sm__inst_executed.sum and smsp__thread_inst_executed.sum metrics. Maybe the first one is at warp level. Am I right? Please let me know if there are more information.

felix_dt · September 10, 2020, 8:35am

Yes, that’s correct, the difference is in counting per warp or per thread. You can find the description for each metric using the ncu --query-metrics flag on the command line. The UI also shows the descriptions when hovering over the metric as a tool tip.

smsp__inst_executed # of warp instructions executed
smsp__thread_inst_executed # of thread instructions executed
smsp__thread_inst_executed_pred_on # of thread instructions executed where guard predicate was true

The difference between sm__ and smsp__ metrics is if they are collected on the SM-level, or on the SMSP-level

mahmood.nt · September 16, 2020, 7:02pm

Hi,

In a sample run I see

smsp__inst_executed.sum                             inst        1107736.000000  1107736.000000  1107736.000000
smsp__thread_inst_executed.sum                      inst        31322624.000000 31322624.000000 31322624.000000
smsp__thread_inst_executed_pred_on.sum              inst        30651904.000000 30651904.000000 30651904.000000

It seems that 31,322,624-1,107,736=30,214,888 which is close to the pred_on stat. So, I guess the thread_inst is the sum of pred_on and inst_executed.

Is that right in terminology of nsight compute?

If we divide thread_inst by 32, I guess it should be the same as inst_executed. Isn’t it? But i see

31,322,624/32 = 978,832

So, there is a difference between that and 1,107,736. The difference is not small though.

Any idea about that?

Greg · September 17, 2020, 6:18am

A SM has 4 sub-partitions (SMSP). Each sub-partition has a warp scheduler, register files, warp slots, dispatch unit, and multiple execution units (ALU, FMA, FP16x2, Tensor Cores, etc.). Metrics starting with smsp refers to counters collected at the SM sub-partition level.

The value of sm__inst_executed.sum == smsp__inst_executed.sum
The value of sm__inst_executed.avg == smsp__inst_executed.avg x 4

felix_dt as provided the description of the counters.

A warp may have 1-MAX_THREADS_PER_WARP active threads per cycle. For CC Tesla - Ampere architecture MAX_THREADS_PER_WARP == 32. The number of threads can be less than MAX_THREADS_PER_WARP if

The kernel was launched such that the number of threads per block is not a multiple of 32. (e.g. kernel<<<1, 1>>>() launches 1 block containing 1 thread. The warp for block 0 will have only 1 active thread.
If a thread in the warp exits.
If there is thread divergence (e.g. a subset of threads take a branch).

smsp__thread_inst_executed.sum <= smsp__inst_executed x MAX_THREADS_PER_WARP.

A value less than this means one of the aforementioned cases occurred.

A warp may have 1 - ACTIVE_THREADS predicated true threads per cycle. The NVIDIA GPU ISA supports per instruction guard predicates that allows individual threads to not participate in the instruction without the overhead of a divergent branch. The compiler will often use thread predication for short sequences of divergent code.

In Nsight Compute the Source View can show these counters values per SASS instruction or rolled-up to higher level language. At the SASS level each instruction can have a guard predicate @[!]P{#.T=true}.

@P0 IADD R0, R1, R2 // if P0 is true then the thread will write-back the result; otherwise, the result is not written back

The @P0 is the guard predicate. Each thread can have a separate true/false value. For more information on guard predicates see the PTX ISA section Instruction Statements.

The following metric relationships exist:
smsp__thread_inst_executed_pred_on.sum
<= smsp__thread_inst_executed.sum
<= smsp__inst_executed x MAX_THREADS_PER_WARP.

Topic		Replies	Views
Question about smsp and sm Nsight Compute	4	1645	September 17, 2020
Difference sm__cycles_elapsed/smsp__cycles_elapsed and sm__inst_executed/smsp__inst_executed? Nsight Compute performance-metrics	5	2221	February 2, 2022
Getting number of total executed instructions Nsight Compute	2	735	May 28, 2021
Difference between thread_inst_executed metrics Nsight Compute performance-metrics	5	1130	June 27, 2022
Small number of thread instructions Nsight Compute	2	762	January 21, 2022
How inst_executed.sum is calculated? Nsight Compute	1	482	September 17, 2020
About instruction per warp metric CUDA Programming and Performance	5	954	March 24, 2020
Clarification on [inst/cycle] attributes Nsight Compute	1	219	January 27, 2025
Some questions about one metric Nsight Compute cuda	7	904	January 25, 2022
Difference in thread-level instruction counts when also collecting warp-level instruction counts Nsight Compute nsight , nvbugs	0	48	March 5, 2026

Inst_executed and thread_inst_executed

Related topics