Profiling overhead

mahmood.nt · January 20, 2022, 10:27am

Hi
I would like to know some information about the overhead of profiler. I didn’t find clear answer to my question in the manual, maybe I missed the tips.

Assume a kernel without any specific metric takes X cycles. If I apply one or ten metrics, does this X changes? I know about variation of cycles on real hardware when the program is executed multiple times, but I want to know about profiler overhead.

The second is the percentage of overhead when the number of cycles is small. For example, if the reported number of cycles is 1000, how much profiler is accurate at that level? How many cycles out of 1000 is devoted to the profiler for collecting the stats?

Thanks in advance…

Greg · January 21, 2022, 5:14pm

Collecting performance counters and sampling shader program counters does not have a direct impact on the duration of the shader as neither data collection system directly inserts any code or stalls into the kernel.

The profiler may perform other operations that impact the duration of the kernel:

Collection of SASS metrics requires binary patching the kernel assembly code. This can increase the kernel duration by as much as 100x. These metrics are collected in separate passes from performance counters.
Fixing the clock rate is done to ensure more accurate rates between replays. The tool will fix the clocks to base clocks. The developer can disable this feature and use nvidia-smi or other tool to fix the clocks to a specific rate.
Flushing the caches between replays to ensure more accurate multi-pass metrics.

The profiler can have an impact on the performance on the application. The profiler

intercepts the CUDA API to control profiling which may add small overhead to CUDA API calls
serializes kernel launches by idling the GPU between launches
stalls the GPU in order to configure the performance monitors or patch kernels
saves and restores all mutable state between replays
at highest sampling rate for shader program counters additional traffic on the PCIe bus could have a small impact on a kernel that is heavily writing pinned system memory.

mahmood.nt · January 22, 2022, 6:16pm

If I understand correctly, you are saying that a metric like gpu__time_duration.sum contains the kernel time plus the profiler overhead while gpc__cycles_elapsed.avg only records kernel execution cycles. Is that right? Then I would like to know if there is any metric to report the kernel time (not cycles) without profiler overhead. Although that can be calculated from the cycles metric, I just want to know the existence of another metric and do the calculations to see if they match.

felix_dt · January 25, 2022, 3:22pm

gpu__time_duration.sum does not contain any profiler overhead, as there is no profiler overhead introduced during the collection of this metric (nor during the collection of the elapsed cycles). The only metrics introducing runtime overhead are SASS metrics, which are collected in separate passes for this matter.

Greg · January 25, 2022, 3:24pm

If you are using Nsight Compute then you are serializing kernels which changes the execution on the GPU. Nsight Compute targets single kernel profiling so its job is to make the information for the kernel execution as accurate as possible by isolating the kernel. gpu__time_duration.sum introduces no measurable overhead to the kernel execution. However, gpu__time_duration is end timestamp - start timestamp. If the kernel is long enough it is likely to context switch and the duration will include time the GPU spent executing another context.

Nsight Systems/CUPTI support the most accurate start timestamp and end timestamp that we can do on the GPU. It is recommended that you use Nsight Systems prior to use Nsight Compute to optimize individual kernels. It is also recommended that you iterate back and forth between the two tools as you can naively optimize a kernel at the loss of concurrency between kernels resulting in a performance regression.

mahmood.nt · January 25, 2022, 6:26pm

Thanks for the replies. Some questions have been answered, yet some new questions have arisen.

By SASS metrics, I assume you are talking about *_sass_* metrics that mostly related to instructions. As I checked the available metrics, I see smsp__sass_inst_executed and smsp__inst_executed. Both are warp-based metrics. So, are they fundamentally different?

What I interpret from your last reply is that, if I use a SASS and time metrics together in scenario 1, the reported time may be larger than scenario 2 where I use only use the time metric. Is that right? If yes, then that means, I have to profile the application in a pass just to record more realistic times. Is that correct?

mahmood.nt · January 26, 2022, 2:40pm

OK. I did a test with a large matrix multiplication run in two scenarios:

cycles,time,sass

  void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:41, Context 1, Stream 13
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    gpc__cycles_elapsed.avg                                                          cycle                   3,459,795.33
    gpu__time_duration.sum                                                         msecond                           2.41
    smsp__sass_inst_executed.sum                                                      inst                    183,468,032
    ---------------------------------------------------------------------- --------------- ------------------------------

  void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:41, Context 1, Stream 13
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    gpc__cycles_elapsed.avg                                                          cycle                   3,458,849.17
    gpu__time_duration.sum                                                         msecond                           2.40
    smsp__sass_inst_executed.sum                                                      inst                    183,468,032
    ---------------------------------------------------------------------- --------------- ------------------------------

and 2) cycles,time

  void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:52, Context 1, Stream 13
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    gpc__cycles_elapsed.avg                                                          cycle                      3,450,932
    gpu__time_duration.sum                                                         msecond                           2.40
    ---------------------------------------------------------------------- --------------- ------------------------------

  void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:52, Context 1, Stream 13
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    gpc__cycles_elapsed.avg                                                          cycle                   3,450,957.83
    gpu__time_duration.sum                                                         msecond                           2.40
    ---------------------------------------------------------------------- --------------- ------------------------------

So, it seems that the gpu__time_duration.sum is robust in this example. Don’t know if I increase the sass metrics or problem size, that will also change.

P.S: I also run with smsp__sass_inst_executed and smsp__inst_executed and they were equal.

felix_dt · January 27, 2022, 7:17am

As I mentioned before, SASS metrics and HW metrics are measured in separate (different) passes so that they don’t influence each other. These passes are scheduled and replayed transparently by the tool. Collection of the HW metric(s) will not change the values collected from SASS, and vice-versa will the SASS collection overhead not impact the precision of the HW metric.

(For some HW metrics, there can be a non-deterministic impact due to cache effects when manually disabling Nsight Compute’s cache control feature).

Topic		Replies	Views
Cycles in nsight-compute and nsight-systems Nsight Compute	2	1339	October 26, 2022
Profile counters for a duration Nsight Compute	1	423	July 20, 2023
Profiler speed Nsight Compute	4	1086	December 19, 2022
Kernel time of Nsight system is larger than nsight compute Profiling Linux Targets	11	1252	April 3, 2024
Is the Nsight System accurate in measuring the execution time of the kernel? Profiling Linux Targets	14	2248	April 6, 2024
Is the profiling session duration equivalent to total runtime when using Nsight Systems? Profiling Linux Targets cuda , kernel , profiling	11	748	May 6, 2024
Nsight Compute slows down Tesla T4 processor clock during profiling Nsight Compute	5	906	October 12, 2021
What is GPU&CPU time in profiler? instrumentation overhead included? CUDA Programming and Performance	0	1149	September 19, 2008
Profiling one application having two concurent kernels Nsight Compute	3	720	June 8, 2023
Getting layer specific kernel metrics in a DL application Nsight Compute	6	737	February 20, 2020

Profiling overhead

Related topics