Profiling overhead

Hi
I would like to know some information about the overhead of profiler. I didn’t find clear answer to my question in the manual, maybe I missed the tips.

Assume a kernel without any specific metric takes X cycles. If I apply one or ten metrics, does this X changes? I know about variation of cycles on real hardware when the program is executed multiple times, but I want to know about profiler overhead.

The second is the percentage of overhead when the number of cycles is small. For example, if the reported number of cycles is 1000, how much profiler is accurate at that level? How many cycles out of 1000 is devoted to the profiler for collecting the stats?

Thanks in advance…

Collecting performance counters and sampling shader program counters does not have a direct impact on the duration of the shader as neither data collection system directly inserts any code or stalls into the kernel.

The profiler may perform other operations that impact the duration of the kernel:

  • Collection of SASS metrics requires binary patching the kernel assembly code. This can increase the kernel duration by as much as 100x. These metrics are collected in separate passes from performance counters.
  • Fixing the clock rate is done to ensure more accurate rates between replays. The tool will fix the clocks to base clocks. The developer can disable this feature and use nvidia-smi or other tool to fix the clocks to a specific rate.
  • Flushing the caches between replays to ensure more accurate multi-pass metrics.

The profiler can have an impact on the performance on the application. The profiler

  • intercepts the CUDA API to control profiling which may add small overhead to CUDA API calls
  • serializes kernel launches by idling the GPU between launches
  • stalls the GPU in order to configure the performance monitors or patch kernels
  • saves and restores all mutable state between replays
  • at highest sampling rate for shader program counters additional traffic on the PCIe bus could have a small impact on a kernel that is heavily writing pinned system memory.

If I understand correctly, you are saying that a metric like gpu__time_duration.sum contains the kernel time plus the profiler overhead while gpc__cycles_elapsed.avg only records kernel execution cycles. Is that right? Then I would like to know if there is any metric to report the kernel time (not cycles) without profiler overhead. Although that can be calculated from the cycles metric, I just want to know the existence of another metric and do the calculations to see if they match.

gpu__time_duration.sum does not contain any profiler overhead, as there is no profiler overhead introduced during the collection of this metric (nor during the collection of the elapsed cycles). The only metrics introducing runtime overhead are SASS metrics, which are collected in separate passes for this matter.

If you are using Nsight Compute then you are serializing kernels which changes the execution on the GPU. Nsight Compute targets single kernel profiling so its job is to make the information for the kernel execution as accurate as possible by isolating the kernel. gpu__time_duration.sum introduces no measurable overhead to the kernel execution. However, gpu__time_duration is end timestamp - start timestamp. If the kernel is long enough it is likely to context switch and the duration will include time the GPU spent executing another context.

Nsight Systems/CUPTI support the most accurate start timestamp and end timestamp that we can do on the GPU. It is recommended that you use Nsight Systems prior to use Nsight Compute to optimize individual kernels. It is also recommended that you iterate back and forth between the two tools as you can naively optimize a kernel at the loss of concurrency between kernels resulting in a performance regression.

Thanks for the replies. Some questions have been answered, yet some new questions have arisen.

By SASS metrics, I assume you are talking about *_sass_* metrics that mostly related to instructions. As I checked the available metrics, I see smsp__sass_inst_executed and smsp__inst_executed. Both are warp-based metrics. So, are they fundamentally different?

What I interpret from your last reply is that, if I use a SASS and time metrics together in scenario 1, the reported time may be larger than scenario 2 where I use only use the time metric. Is that right? If yes, then that means, I have to profile the application in a pass just to record more realistic times. Is that correct?

OK. I did a test with a large matrix multiplication run in two scenarios:

  1. cycles,time,sass
  void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:41, Context 1, Stream 13
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    gpc__cycles_elapsed.avg                                                          cycle                   3,459,795.33
    gpu__time_duration.sum                                                         msecond                           2.41
    smsp__sass_inst_executed.sum                                                      inst                    183,468,032
    ---------------------------------------------------------------------- --------------- ------------------------------

  void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:41, Context 1, Stream 13
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    gpc__cycles_elapsed.avg                                                          cycle                   3,458,849.17
    gpu__time_duration.sum                                                         msecond                           2.40
    smsp__sass_inst_executed.sum                                                      inst                    183,468,032
    ---------------------------------------------------------------------- --------------- ------------------------------

and 2) cycles,time

  void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:52, Context 1, Stream 13
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    gpc__cycles_elapsed.avg                                                          cycle                      3,450,932
    gpu__time_duration.sum                                                         msecond                           2.40
    ---------------------------------------------------------------------- --------------- ------------------------------

  void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:52, Context 1, Stream 13
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    gpc__cycles_elapsed.avg                                                          cycle                   3,450,957.83
    gpu__time_duration.sum                                                         msecond                           2.40
    ---------------------------------------------------------------------- --------------- ------------------------------

So, it seems that the gpu__time_duration.sum is robust in this example. Don’t know if I increase the sass metrics or problem size, that will also change.

P.S: I also run with smsp__sass_inst_executed and smsp__inst_executed and they were equal.

As I mentioned before, SASS metrics and HW metrics are measured in separate (different) passes so that they don’t influence each other. These passes are scheduled and replayed transparently by the tool. Collection of the HW metric(s) will not change the values collected from SASS, and vice-versa will the SASS collection overhead not impact the precision of the HW metric.

(For some HW metrics, there can be a non-deterministic impact due to cache effects when manually disabling Nsight Compute’s cache control feature).

1 Like