I am using Nsight Systems/Compute to analyze running trtexec (in TensorRT samples directory) on Jetson Xavier for MobileNetV1 in int8 precision.
The problem I am facing is that when I use --dumpProfile, the latencies of all layers increase.
For instance, when using Nsight Systems, 1st and 2nd layers run:
without dumpProfile: 195us, 126us
with dumpProfile: 202us, 182us
and so on for the rest of the layers…
So I looked into the Nsight Compute to have a better understanding on why they are different. I figured out that although the GPU frequency is the same in both cases (I used jetson_clocks), the SM frequency is higher when dumpProfile is not used.
So my question is: Why does dumpProfile slow down the kernel execution?
Since TensorRT processes each layer in sequence, I was assuming that the summation of all layers’ latencies will be equal to the GPU compute time when dumpProfile is not used. Unfortunately, right now, I can see that the summation of layers is more than the GPU compute time (comparing to when dumpProfile is not used).