Different SM frequency when using profiler on trtexec

Hello,

I am using Nsight Systems/Compute to analyze running trtexec (in TensorRT samples directory) on Jetson Xavier for MobileNetV1 in int8 precision.

The problem I am facing is that when I use --dumpProfile, the latencies of all layers increase.

For instance, when using Nsight Systems, 1st and 2nd layers run:
without dumpProfile: 195us, 126us
with dumpProfile: 202us, 182us
and so on for the rest of the layers…

So I looked into the Nsight Compute to have a better understanding on why they are different. I figured out that although the GPU frequency is the same in both cases (I used jetson_clocks), the SM frequency is higher when dumpProfile is not used.

So my question is: Why does dumpProfile slow down the kernel execution?

Since TensorRT processes each layer in sequence, I was assuming that the summation of all layers’ latencies will be equal to the GPU compute time when dumpProfile is not used. Unfortunately, right now, I can see that the summation of layers is more than the GPU compute time (comparing to when dumpProfile is not used).

Hi,

The profiler can be found in our TensorRT sample:

/usr/src/tensorrt/samples/common/common.h

The latency is small and it should be under the expectation.
To record the execution time for each layer will introduces some latency in launching next layer.

Thanks.

Hello AastaLLL,

Thanks for your message.

“To record the execution time for each layer will introduces some latency in launching next layer.”

That exactly my question!

Can you explain more on why recording layer latency imposes some delay in launching the next layer? What are the architectural/software implications?

Thanks,

Following up regarding my previous question, any help is greatly appreciated and I am eager to learn why this happens! Thanks