Cycles in nsight-compute and nsight-systems

For the matrixmul code in the samples, I see different cycle counts (or time elapsed) with nsight compute and nsight systems.
1- In nsight compute, I used gpc__cycles_elapsed.avg and then sum up all invocations.

    Section: Command line profiler metrics
    Metric Name             Metric Unit Minimum      Maximum      Average     
    ----------------------- ----------- ------------ ------------ ------------
    gpc__cycles_elapsed.avg cycle       34854.500000 36328.500000 35268.038206

The results is 10,615,679 cycles assuming 301 kernel calls. On a 2080Ti running at 1.45GHz, that is 7321158 ns

2- In nsight system, I see the following time

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                Name                              
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------
    100.0        8,165,424        301  27,127.7  26,946.0    26,721    35,585      1,013.3  void MatrixMulCUDA<(int)32>(float *, float *, float *, int, int)

Here the total time is 8.1M ns.

So there are two run times actually 7.3M ns vs. 8.1M ns.
That is a small example, so maybe that can be justified by the profiling overhead. However, I have seen differences in other workloads. So, I wonder if the under-the-hood algorithms to calculate the time and cycle are similar or different. Any thoughts on that?

Updating my reply with some more precise information about how the collection methods differ. You can find this info here: Profiling overhead - #5 by Greg

I’ll paste it here for ease of access:
If you are using Nsight Compute then you are serializing kernels which changes the execution on the GPU. Nsight Compute targets single kernel profiling so its job is to make the information for the kernel execution as accurate as possible by isolating the kernel. gpu__time_duration.sum introduces no measurable overhead to the kernel execution. However, gpu__time_duration is end timestamp - start timestamp. If the kernel is long enough it is likely to context switch and the duration will include time the GPU spent executing another context.

Nsight Systems/CUPTI support the most accurate start timestamp and end timestamp that we can do on the GPU. It is recommended that you use Nsight Systems prior to use Nsight Compute to optimize individual kernels. It is also recommended that you iterate back and forth between the two tools as you can naively optimize a kernel at the loss of concurrency between kernels resulting in a performance regression.

Due to these differences in collection methodologies there can variations, particularly in smaller kernels. It’s recommended to start with Nsight Systems for kernel runtime measurement and Nsight Compute for measuring the performance internals of the kernel.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.