Difference between nsight-compute and nsys for calculating average value

Hi
I would like to know how the value of average is calculated by “nsys profile --trace=cuda” because with nv-nsight-cu-cli I am not able to reproduce that.

For example, I see

  execute3DconvolutionCuda_split(float*, float*, float*, float*, int, int, int, int, int, int, int, int, int), Block Size 1024, Grid Size 64, Device 0, 6 invocations
    Section: Command line profiler metrics
    Metric Name                            Metric Unit Minimum          Maximum           Average
    -------------------------------------- ----------- ---------------- ----------------- -----------------
    gpu__time_active.avg                   msecond     0.242144         2.102880          1.409056


  execute3DconvolutionCuda_split(float*, float*, float*, float*, int, int, int, int, int, int, int, int, int), Block Size 1024, Grid Size 256, Device 0, 4 invocations
    Section: Command line profiler metrics
    Metric Name                            Metric Unit Minimum           Maximum           Average
    -------------------------------------- ----------- ----------------- ----------------- -----------------
    gpu__time_active.avg                   usecond     927.776000        947.072000        933.528000

and

 Time(%)  Total Time (ns)  Instances    Average     Minimum    Maximum                                                   Name                      
 -------  ---------------  ---------  -----------  ---------  ---------  ----------------------------------------------------------------------------------------------------
    31.4       12,513,663         10  1,251,366.3    252,750  2,158,617  execute3DconvolutionCuda_split(float*, float*, float*, float*, int, int, int, int, int, int, int, i…

As you can see the total number of invocations from nsight compute (6+4) is the same that nsys (10). The minimum and maximum are reasonable. Manually computing the average, I use weighted average which shows

Weighted average:
(61409 + 4933.5)/10 = 1015.6 useconds

Even with other methods (no matter if they are meaningful or not in this example), we see:

1- Arithmetic average:
(1409+933.5)/2 = 1171.25 useconds

2- Weighted harmonic mean:
(6+4)/((6/1409)+(4/933.5)) = 1170.5

3- Weighted geometric mean:
(1409^6 * 933.5^4) ^ (1/(6+4)) = 1195 useconds

However, nsys says 1251 useconds.
Although some differences are reasonable, I am not sure if 200 useconds shows that nsys calculates the average by another way.

Any thought on that?

The kernel duration is not measured using gpu__time_active.avg in Nsight Compute, but rather using gpu__time_duration.sum. You can refer to the SpeedOfLight.section file for reference, or to the respective entry in the Details page. You should also be aware that kernel durations are not measured identically between Nsight Compute and Systems due to different methodologies to collect the data. Last but not least, Nsight Compute is locking the GPU clocks, serializing kernels and API calls and flushing all caches before kernel launches (see Kernel Profiling Guide :: Nsight Compute Documentation). As such, you shouldn’t expect the values to be the same in general.

1 Like