Different SM frequency when using profiler on trtexec

cmehrshad · June 23, 2020, 3:44am

Hello,

I am using Nsight Systems/Compute to analyze running trtexec (in TensorRT samples directory) on Jetson Xavier for MobileNetV1 in int8 precision.

The problem I am facing is that when I use --dumpProfile, the latencies of all layers increase.

For instance, when using Nsight Systems, 1st and 2nd layers run:
without dumpProfile: 195us, 126us
with dumpProfile: 202us, 182us
and so on for the rest of the layers…

So I looked into the Nsight Compute to have a better understanding on why they are different. I figured out that although the GPU frequency is the same in both cases (I used jetson_clocks), the SM frequency is higher when dumpProfile is not used.

So my question is: Why does dumpProfile slow down the kernel execution?

Since TensorRT processes each layer in sequence, I was assuming that the summation of all layers’ latencies will be equal to the GPU compute time when dumpProfile is not used. Unfortunately, right now, I can see that the summation of layers is more than the GPU compute time (comparing to when dumpProfile is not used).

AastaLLL · June 23, 2020, 6:57am

Hi,

The profiler can be found in our TensorRT sample:

/usr/src/tensorrt/samples/common/common.h

The latency is small and it should be under the expectation.
To record the execution time for each layer will introduces some latency in launching next layer.

Thanks.

cmehrshad · June 23, 2020, 5:26pm

Hello AastaLLL,

Thanks for your message.

“To record the execution time for each layer will introduces some latency in launching next layer.”

That exactly my question!

Can you explain more on why recording layer latency imposes some delay in launching the next layer? What are the architectural/software implications?

Thanks,

cmehrshad · June 28, 2020, 2:25am

Following up regarding my previous question, any help is greatly appreciated and I am eager to learn why this happens! Thanks

Topic		Replies	Views
Trtexec: summation of all layer using profiling is lower than total latency Deep Learning (Training & Inference)	3	934	October 12, 2021
Is it possible to know how much time each layer takes on TensorRT? TensorRT	3	1023	April 27, 2022
Performance of Neural Network on Jetson Jetson TX2	6	619	October 18, 2021
Profile results of model running on DLA mismatch between TensorRT and nsys Jetson AGX Orin tensorrt , dla	10	1201	April 5, 2023
Need some precisions about trtexec measures TensorRT tensorrt	3	2002	October 12, 2021
Weird performance issue Jetson AGX Xavier performance	12	949	October 18, 2021
Profiling DLRM ML training using nsight system Profiling Linux Targets	3	636	November 29, 2023
TensorRT 5 - Python profiler TensorRT	4	2714	October 12, 2021
Profiling DLA with GPU fallback on Jetson Xavier Jetson AGX Xavier dla	6	1631	August 29, 2021
Trtexec profile TensorRT	6	3321	October 12, 2021

Different SM frequency when using profiler on trtexec

Related topics