Inst_executed and thread_inst_executed

I would like to know what is there difference between sm__inst_executed.sum and smsp__thread_inst_executed.sum metrics. Maybe the first one is at warp level. Am I right? Please let me know if there are more information.

Yes, that’s correct, the difference is in counting per warp or per thread. You can find the description for each metric using the ncu --query-metrics flag on the command line. The UI also shows the descriptions when hovering over the metric as a tool tip.

smsp__inst_executed # of warp instructions executed
smsp__thread_inst_executed # of thread instructions executed
smsp__thread_inst_executed_pred_on # of thread instructions executed where guard predicate was true

The difference between sm__ and smsp__ metrics is if they are collected on the SM-level, or on the SMSP-level

1 Like


In a sample run I see

smsp__inst_executed.sum                             inst        1107736.000000  1107736.000000  1107736.000000
smsp__thread_inst_executed.sum                      inst        31322624.000000 31322624.000000 31322624.000000
smsp__thread_inst_executed_pred_on.sum              inst        30651904.000000 30651904.000000 30651904.000000

It seems that 31,322,624-1,107,736=30,214,888 which is close to the pred_on stat. So, I guess the thread_inst is the sum of pred_on and inst_executed.

Is that right in terminology of nsight compute?

If we divide thread_inst by 32, I guess it should be the same as inst_executed. Isn’t it? But i see

31,322,624/32 = 978,832

So, there is a difference between that and 1,107,736. The difference is not small though.

Any idea about that?

A SM has 4 sub-partitions (SMSP). Each sub-partition has a warp scheduler, register files, warp slots, dispatch unit, and multiple execution units (ALU, FMA, FP16x2, Tensor Cores, etc.). Metrics starting with smsp refers to counters collected at the SM sub-partition level.

The value of sm__inst_executed.sum == smsp__inst_executed.sum
The value of sm__inst_executed.avg == smsp__inst_executed.avg x 4

felix_dt as provided the description of the counters.

A warp may have 1-MAX_THREADS_PER_WARP active threads per cycle. For CC Tesla - Ampere architecture MAX_THREADS_PER_WARP == 32. The number of threads can be less than MAX_THREADS_PER_WARP if

  1. The kernel was launched such that the number of threads per block is not a multiple of 32. (e.g. kernel<<<1, 1>>>() launches 1 block containing 1 thread. The warp for block 0 will have only 1 active thread.
  2. If a thread in the warp exits.
  3. If there is thread divergence (e.g. a subset of threads take a branch).

smsp__thread_inst_executed.sum <= smsp__inst_executed x MAX_THREADS_PER_WARP.

A value less than this means one of the aforementioned cases occurred.

A warp may have 1 - ACTIVE_THREADS predicated true threads per cycle. The NVIDIA GPU ISA supports per instruction guard predicates that allows individual threads to not participate in the instruction without the overhead of a divergent branch. The compiler will often use thread predication for short sequences of divergent code.

In Nsight Compute the Source View can show these counters values per SASS instruction or rolled-up to higher level language. At the SASS level each instruction can have a guard predicate @[!]P{#.T=true}.

@P0 IADD R0, R1, R2 // if P0 is true then the thread will write-back the result; otherwise, the result is not written back

The @P0 is the guard predicate. Each thread can have a separate true/false value. For more information on guard predicates see the PTX ISA section Instruction Statements.

The following metric relationships exist:
<= smsp__thread_inst_executed.sum
<= smsp__inst_executed x MAX_THREADS_PER_WARP.