Hi
For the matrixmul code in the samples, I see different cycle counts (or time elapsed) with nsight compute and nsight systems.
1- In nsight compute, I used gpc__cycles_elapsed.avg
and then sum up all invocations.
Section: Command line profiler metrics
Metric Name Metric Unit Minimum Maximum Average
----------------------- ----------- ------------ ------------ ------------
gpc__cycles_elapsed.avg cycle 34854.500000 36328.500000 35268.038206
The results is 10,615,679 cycles assuming 301 kernel calls. On a 2080Ti running at 1.45GHz, that is 7321158 ns
2- In nsight system, I see the following time
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------
100.0 8,165,424 301 27,127.7 26,946.0 26,721 35,585 1,013.3 void MatrixMulCUDA<(int)32>(float *, float *, float *, int, int)
Here the total time is 8.1M ns.
So there are two run times actually 7.3M ns vs. 8.1M ns.
That is a small example, so maybe that can be justified by the profiling overhead. However, I have seen differences in other workloads. So, I wonder if the under-the-hood algorithms to calculate the time and cycle are similar or different. Any thoughts on that?