Trtexec: summation of all layer using profiling is lower than total latency

Hello,

I am using trtexec sample from TensorRT 7 on Xavier to profile MobileNetV1.

The problem I have is that the summation of averageMs of all the layers which are obtained from profiling output is much less than the GPU compute time!

This means that the overall latency of the model is more than the summation of the latency of each layer. ( 0.448451>0.38081531)

So I was wondering why that is?
How is each layer’s latency calculated that the summation is not equal to the overall latency?
Or is that a bug?

Thanks,

Steps to reproduce:

Command

 ./bin/trtexec --fp16 --uff=./mobilenet_v1_0p25.uff --uffInput=input_1,128,128,3 --uffNHWC --output=dense_3/Softmax --batch=1 --exportTimes=times.json --exportOutput=out.json --exportProfile=prof.json

Console Output

[05/29/2020-17:02:46] [I] Host latency
[05/29/2020-17:02:46] [I] min: 0.410889 ms (end to end 0.416748 ms)
[05/29/2020-17:02:46] [I] max: 0.775757 ms (end to end 0.789551 ms)
[05/29/2020-17:02:46] [I] mean: 0.448451 ms (end to end 0.454948 ms)
[05/29/2020-17:02:46] [I] median: 0.444458 ms (end to end 0.450684 ms)
[05/29/2020-17:02:46] [I] percentile: 0.530518 ms at 99% (end to end 0.539307 ms at 99%)
[05/29/2020-17:02:46] [I] throughput: 2148.53 qps
[05/29/2020-17:02:46] [I] walltime: 3.00066 s
[05/29/2020-17:02:46] [I] GPU Compute
[05/29/2020-17:02:46] [I] min: 0.391113 ms
[05/29/2020-17:02:46] [I] max: 0.74353 ms
[05/29/2020-17:02:46] [I] mean: 0.42534 ms
[05/29/2020-17:02:46] [I] median: 0.421387 ms
[05/29/2020-17:02:46] [I] percentile: 0.504395 ms at 99%
[05/29/2020-17:02:46] [I] total compute time: 2.74217 s
&&&& PASSED TensorRT.trtexec #  ./bin/trtexec --fp16 --uff=./mobilenet_v1_0p25.uff --uffInput=input_1,128,128,3 --uffNHWC --output=dense_3/Softmax --batch=1 --exportTimes=times.json --exportOutput=out.json --exportProfile=prof.json

Contents of prof.json

[
  { "count" : 7874 }
, { "name" : "conv1_pad/Pad", "timeMs" : 104.769, "averageMs" : 0.0133057, "percentage" : 4.01779 }
, { "name" : "(Unnamed Layer* 3) [Shuffle]", "timeMs" : 116.91, "averageMs" : 0.0148476, "percentage" : 4.48337 }
, { "name" : "conv1/convolution + conv1_relu/Relu6 input reformatter 0", "timeMs" : 60.5926, "averageMs" : 0.00769527, "percentage" : 2.32366 }
, { "name" : "conv1/convolution + conv1_relu/Relu6", "timeMs" : 155.5, "averageMs" : 0.0197486, "percentage" : 5.96326 }
, { "name" : "conv_dw_1/depthwise + conv_dw_1_relu/Relu6", "timeMs" : 68.0761, "averageMs" : 0.00864568, "percentage" : 2.61064 }
, { "name" : "conv_pw_1/convolution + conv_pw_1_relu/Relu6", "timeMs" : 121.864, "averageMs" : 0.0154768, "percentage" : 4.67335 }
, { "name" : "conv_pad_2/Pad + conv_dw_2/depthwise + conv_dw_2_relu/Relu6", "timeMs" : 73.93, "averageMs" : 0.00938913, "percentage" : 2.83513 }
, { "name" : "conv_pw_2/convolution + conv_pw_2_relu/Relu6", "timeMs" : 82.048, "averageMs" : 0.0104201, "percentage" : 3.14645 }
, { "name" : "conv_dw_3/depthwise + conv_dw_3_relu/Relu6", "timeMs" : 44.8531, "averageMs" : 0.00569635, "percentage" : 1.72007 }
, { "name" : "conv_pw_3/convolution + conv_pw_3_relu/Relu6", "timeMs" : 95.7952, "averageMs" : 0.012166, "percentage" : 3.67364 }
, { "name" : "conv_pad_4/Pad + conv_dw_4/depthwise + conv_dw_4_relu/Relu6 + conv_pw_4/convolution + conv_pw_4_relu/Relu6", "timeMs" : 155.247, "averageMs" : 0.0197164, "percentage" : 5.95354 }
, { "name" : "conv_dw_5/depthwise + conv_dw_5_relu/Relu6", "timeMs" : 92.3621, "averageMs" : 0.01173, "percentage" : 3.54198 }
, { "name" : "conv_pw_5/convolution + conv_pw_5_relu/Relu6", "timeMs" : 74.8902, "averageMs" : 0.00951108, "percentage" : 2.87196 }
, { "name" : "conv_pad_6/Pad + conv_dw_6/depthwise + conv_dw_6_relu/Relu6", "timeMs" : 48.2069, "averageMs" : 0.00612228, "percentage" : 1.84868 }
, { "name" : "conv_pw_6/convolution + conv_pw_6_relu/Relu6", "timeMs" : 74.3637, "averageMs" : 0.00944421, "percentage" : 2.85177 }
, { "name" : "conv_dw_7/depthwise + conv_dw_7_relu/Relu6 + conv_pw_7/convolution + conv_pw_7_relu/Relu6", "timeMs" : 145.312, "averageMs" : 0.0184547, "percentage" : 5.57256 }
, { "name" : "conv_dw_8/depthwise + conv_dw_8_relu/Relu6 + conv_pw_8/convolution + conv_pw_8_relu/Relu6", "timeMs" : 80.9075, "averageMs" : 0.0102753, "percentage" : 3.10271 }
, { "name" : "conv_dw_9/depthwise + conv_dw_9_relu/Relu6 + conv_pw_9/convolution + conv_pw_9_relu/Relu6", "timeMs" : 81.0611, "averageMs" : 0.0102948, "percentage" : 3.1086 }
, { "name" : "conv_dw_10/depthwise + conv_dw_10_relu/Relu6 + conv_pw_10/convolution + conv_pw_10_relu/Relu6", "timeMs" : 80.2497, "averageMs" : 0.0101917, "percentage" : 3.07749 }
, { "name" : "conv_dw_11/depthwise + conv_dw_11_relu/Relu6 + conv_pw_11/convolution + conv_pw_11_relu/Relu6", "timeMs" : 80.4744, "averageMs" : 0.0102203, "percentage" : 3.0861 }
, { "name" : "conv_pad_12/Pad + conv_dw_12/depthwise + conv_dw_12_relu/Relu6", "timeMs" : 45.9827, "averageMs" : 0.00583982, "percentage" : 1.76339 }
, { "name" : "conv_pw_12/convolution + conv_pw_12_relu/Relu6", "timeMs" : 68.9957, "averageMs" : 0.00876247, "percentage" : 2.64591 }
, { "name" : "conv_dw_13/depthwise + conv_dw_13_relu/Relu6", "timeMs" : 58.7873, "averageMs" : 0.007466, "percentage" : 2.25443 }
, { "name" : "conv_pw_13/convolution + conv_pw_13_relu/Relu6", "timeMs" : 82.5939, "averageMs" : 0.0104894, "percentage" : 3.16738 }
, { "name" : "global_average_pooling2d_1/Mean input reformatter 0", "timeMs" : 55.7543, "averageMs" : 0.00708081, "percentage" : 2.13811 }
, { "name" : "global_average_pooling2d_1/Mean", "timeMs" : 71.4661, "averageMs" : 0.00907621, "percentage" : 2.74064 }
, { "name" : "dense_1/MatMul + dense_1/BiasAdd + dense_1/Relu", "timeMs" : 131.298, "averageMs" : 0.0166749, "percentage" : 5.03512 }
, { "name" : "dense_2/MatMul + dense_2/BiasAdd + dense_2/Relu", "timeMs" : 88.5753, "averageMs" : 0.0112491, "percentage" : 3.39676 }
, { "name" : "dense_3/MatMul + dense_3/BiasAdd", "timeMs" : 92.9879, "averageMs" : 0.0118095, "percentage" : 3.56598 }
, { "name" : "dense_3/Softmax", "timeMs" : 73.7836, "averageMs" : 0.00937054, "percentage" : 2.82952 }
]

The summation of the averageMs is 0.38081531

After trying to see what happens using Nsight Systems, I noticed that using --dumpProfile or --exportTimes will increase the GPU compute time for 2 reasons:

  1. There is an overhead due to calls for cudaEventRecord between each layer
  2. The kernel execution time for each layer is more than that of the same layer when trtexec’s profileris used.

So my question right now is the 2nd observation: Why does using --dumpProfile increase the latency of executing a layer?

Seems that recording layer performance impacts launching the next layer!

https://forums.developer.nvidia.com/t/different-sm-frequency-when-using-profiler-on-trtexec/129207