Trtexec: summation of all layer using profiling is lower than total latency

Seems that recording layer performance impacts launching the next layer!

https://forums.developer.nvidia.com/t/different-sm-frequency-when-using-profiler-on-trtexec/129207