Hi
I would like to know how the value of average is calculated by “nsys profile --trace=cuda” because with nv-nsight-cu-cli I am not able to reproduce that.
For example, I see
execute3DconvolutionCuda_split(float*, float*, float*, float*, int, int, int, int, int, int, int, int, int), Block Size 1024, Grid Size 64, Device 0, 6 invocations
Section: Command line profiler metrics
Metric Name Metric Unit Minimum Maximum Average
-------------------------------------- ----------- ---------------- ----------------- -----------------
gpu__time_active.avg msecond 0.242144 2.102880 1.409056
execute3DconvolutionCuda_split(float*, float*, float*, float*, int, int, int, int, int, int, int, int, int), Block Size 1024, Grid Size 256, Device 0, 4 invocations
Section: Command line profiler metrics
Metric Name Metric Unit Minimum Maximum Average
-------------------------------------- ----------- ----------------- ----------------- -----------------
gpu__time_active.avg usecond 927.776000 947.072000 933.528000
and
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- ----------- --------- --------- ----------------------------------------------------------------------------------------------------
31.4 12,513,663 10 1,251,366.3 252,750 2,158,617 execute3DconvolutionCuda_split(float*, float*, float*, float*, int, int, int, int, int, int, int, i…
As you can see the total number of invocations from nsight compute (6+4) is the same that nsys (10). The minimum and maximum are reasonable. Manually computing the average, I use weighted average which shows
Weighted average:
(61409 + 4933.5)/10 = 1015.6 useconds
Even with other methods (no matter if they are meaningful or not in this example), we see:
1- Arithmetic average:
(1409+933.5)/2 = 1171.25 useconds
2- Weighted harmonic mean:
(6+4)/((6/1409)+(4/933.5)) = 1170.5
3- Weighted geometric mean:
(1409^6 * 933.5^4) ^ (1/(6+4)) = 1195 useconds
However, nsys says 1251 useconds.
Although some differences are reasonable, I am not sure if 200 useconds shows that nsys calculates the average by another way.
Any thought on that?