Graphics perf counters meaning

My requirement is to analyze my game frames to locate the bottleneck draw calls and the compute-intense shaders in those draw calls. Here is my plan.

  1. locate the bottleneck draw calls by looking the SM throughput counter:
    sm__throughput.avg.pct_of_peak_sustained_elapsed (%)
  2. locate the compute-intense shader stages by looking at the throughput of each shader stage:
    [Counter group 1]
    sm__cycles_active_shader_cs.avg.pct_of_peak_sustained_elapsed (%)
    sm__cycles_active_shader_gs.avg.pct_of_peak_sustained_elapsed (%)
    sm__cycles_active_shader_ps.avg.pct_of_peak_sustained_elapsed (%)
    sm__cycles_active_shader_tcs.avg.pct_of_peak_sustained_elapsed (%)
    sm__cycles_active_shader_tes.avg.pct_of_peak_sustained_elapsed (%)
    sm__cycles_active_shader_vs.avg.pct_of_peak_sustained_elapsed (%)
    [Counter group 2]
    sm__warps_active.sum
    sm__warps_active_shader_vtg.sum
    sm__warps_active_shader_ps.sum
    sm__warps_active_shader_cs.sum

Does this plan sound reasonable?

By looking at my profiling results, I have the following two questions.

  1. What is the meaning of the following counters?
    sm__cycles_active_shader_cs.avg.pct_of_peak_sustained_elapsed (%)
    sm__cycles_active_shader_gs.avg.pct_of_peak_sustained_elapsed (%)
    sm__cycles_active_shader_ps.avg.pct_of_peak_sustained_elapsed (%)
    sm__cycles_active_shader_tcs.avg.pct_of_peak_sustained_elapsed (%)
    sm__cycles_active_shader_tes.avg.pct_of_peak_sustained_elapsed (%)
    sm__cycles_active_shader_vs.avg.pct_of_peak_sustained_elapsed (%)

I assume
sm__cycles_active_shader_xx.avg.pct_of_peak_sustained_elapsed (%) =
sm__cycles_active_shader_xx.avg / sm__cycles_elapsed_shader_xx.avg. So the value should be less than 100(< 100%). However, in one of the profiling results I am investigating sm__cycles_active_shader_ps.avg.pct_of_peak_sustained_elapsed (%) is more larger than 100.

sm__cycles_active.avg.pct_of_peak_sustained_elapsed (%) sm__cycles_active_shader_cs.avg.pct_of_peak_sustained_elapsed (%) sm__cycles_active_shader_gs.avg.pct_of_peak_sustained_elapsed (%) sm__cycles_active_shader_ps.avg.pct_of_peak_sustained_elapsed (%) sm__cycles_active_shader_tcs.avg.pct_of_peak_sustained_elapsed (%) sm__cycles_active_shader_tes.avg.pct_of_peak_sustained_elapsed (%) sm__cycles_active_shader_vs.avg.pct_of_peak_sustained_elapsed (%) sm__cycles_elapsed.avg.pct_of_peak_sustained_elapsed (%)
99.12263 99.18941 0 0 0 0 0 100
96.12748 96.24648 0 0 0 0 0 100
96.62175 96.69736 0 0 0 0 0 100
95.00545 95.65731 0 0 0 0 0 100
94.49121 94.46158 0 0 0 0 0 100
95.29343 0 0 103.65118 0 0 0.01094 100
95.29874 95.22861 0 0 0 0 0 100
97.74141 97.69843 0 0 0 0 0 100
79.56786 0 0 173.34756 0 0 0.53782 100
75.76738 0 0 162.66343 0 0 0.72478 100
55.19795 0 0 0 0 0 54.24617 100
53.87546 0 0 0 0 0 54.24021 100
78.82896 0 0 54.93192 0 0 54.98424 100
83.45561 81.61367 0 0 0 0 0 100
58.72605 0 0 21.3328 0 0 45.36058 100
56.48805 0 0 46.95401 0 0 40.3861 100
96.22414 0 0 98.76607 0 0 0.00389 100
  1. What is the relationship among the following counters?
    sm__warps_active.sum
    sm__warps_active_shader_vtg.sum
    sm__warps_active_shader_ps.sum
    sm__warps_active_shader_cs.sum
    I assume sm__warps_active.sum >= sm__warps_active_shader_vtg.sum + sm__warps_active_shader_ps.sum + sm__warps_active_shader_cs.sum. However, this doesn’t seem to be the case according to my data. In addition, why we don’t have these counters sm__warps_active_shader_vs.sum, sm__warps_active_shader_tcs.sum, sm__warps_active_shader_tes.sum and sm__warps_active_shader_gs.sum?

Your current plan will only help identify compute intensive shaders. Optimizations are also import on low throughput shaders. The latest Nsight Graphics GPU Trace tool has a lot of good features for understanding Unit Throughputs and mapping back to shader via the shader profiler. I will defer to the graphics profiler team on the best method to use the tool.

The formula you specified is correct.
The sm__cycles_active_shader_{shader_type} increments by 1 per cycle if the SM has 1 or more warps of {shader_type} resident on the SM. sm__warps_active_shader_{shader_type} is required to determine how many warps.

The SM can run more than 1 shader type at a time so SUM_TYPES(sm__cycles_active_shader_{shader_type}.avg.pct_of_peak_sustained_elapsed) can exceed 100%.

The relationship you specified is correct. The fomula using .avg or .sum is correct. The formula

sm__warps_active.avg.pct_of_peak_sustained_elapsed = SUM(sm__warps_active_shader_{vtg, ps, cs}.avg.pct_of_peak_sustained_elapsed) is not valid as some shader types have a reduced limit. For example, VTG on many chips is limited to 32 warps whereas the PS, CS, and SM max are 48 (or 64). Adding the .pct_of_peak_sustained_elapsed will exceed 100%.

The hardware PM signals do not exist. VTG covers vertex, tesselation (TCS/TES, DS/HS), and Mesh (amplificaiton, mesh) shaders.

Hi Greg,

Thanks for the reply. @Greg

Currently I am focusing on compute-intense shaders, and will look at memory-bound shaders later on.

According to my profiling result, there are some draw calls with
sm__warps_active.sum < sm__warps_active_shader_vtg.sum + sm__warps_active_shader_ps.sum + sm__warps_active_shader_cs.sum, could you explain why?

When captured in the same pass I do not expect more than a ± 2% error in your formula. If you see a larger difference, then I would recommend filing a bug that includes your GPU, tools/sdk version, etc.

Hi Greg,

Here is one of my profiling results:

EID sm__warps_active.sum sm__warps_active_shader_cs.sum sm__warps_active_shader_ps.sum sm__warps_active_shader_vtg.sum
52 1.55E+07 0 1.15E+07 3588239
56 6370862 0 2842330 3540487
60 640909 0 301632 322197
64 2695681 0 2151912 336417
68 3253125 0 939804 2387980

I profiled with renderdoc v1.33 on Nvidia Geforce 3070. The driver version is 556.12. @Greg

Thanks,
Wallace