Inst_executed and thread_inst_executed

A SM has 4 sub-partitions (SMSP). Each sub-partition has a warp scheduler, register files, warp slots, dispatch unit, and multiple execution units (ALU, FMA, FP16x2, Tensor Cores, etc.). Metrics starting with smsp refers to counters collected at the SM sub-partition level.

The value of sm__inst_executed.sum == smsp__inst_executed.sum
The value of sm__inst_executed.avg == smsp__inst_executed.avg x 4

felix_dt as provided the description of the counters.

A warp may have 1-MAX_THREADS_PER_WARP active threads per cycle. For CC Tesla - Ampere architecture MAX_THREADS_PER_WARP == 32. The number of threads can be less than MAX_THREADS_PER_WARP if

  1. The kernel was launched such that the number of threads per block is not a multiple of 32. (e.g. kernel<<<1, 1>>>() launches 1 block containing 1 thread. The warp for block 0 will have only 1 active thread.
  2. If a thread in the warp exits.
  3. If there is thread divergence (e.g. a subset of threads take a branch).

smsp__thread_inst_executed.sum <= smsp__inst_executed x MAX_THREADS_PER_WARP.

A value less than this means one of the aforementioned cases occurred.

A warp may have 1 - ACTIVE_THREADS predicated true threads per cycle. The NVIDIA GPU ISA supports per instruction guard predicates that allows individual threads to not participate in the instruction without the overhead of a divergent branch. The compiler will often use thread predication for short sequences of divergent code.

In Nsight Compute the Source View can show these counters values per SASS instruction or rolled-up to higher level language. At the SASS level each instruction can have a guard predicate @[!]P{#.T=true}.

@P0 IADD R0, R1, R2 // if P0 is true then the thread will write-back the result; otherwise, the result is not written back

The @P0 is the guard predicate. Each thread can have a separate true/false value. For more information on guard predicates see the PTX ISA section Instruction Statements.

The following metric relationships exist:
smsp__thread_inst_executed_pred_on.sum
<= smsp__thread_inst_executed.sum
<= smsp__inst_executed x MAX_THREADS_PER_WARP.

1 Like